Introduction¶

This project aims to build robust, interpretable machine learning models to distinguish hypoxic from normoxic states in cancer cell lines based on high-dimensional gene expression data. We focus on two cell lines — HCC1806 and MCF7 — profiled using two RNA sequencing methods: Smart-seq, a high-sensitivity, full-length transcript approach, and Drop-seq, a high-throughput, lower-sensitivity alternative.

Our pipeline includes extensive preprocessing and downstream analysis of normalized expression matrices.

Unsupervised methods are used to uncover intrinsic structure, including clustering (Hierarchical, Leiden, k-means) and dimensionality reduction (PCA, t-SNE, UMAP) to identify patterns across oxygen conditions.

Supervised models — logistic regression, SVMs, random forests, and MLPs — are trained to classify samples by oxygen state and identify key features driving hypoxic responses.

Dataset Naming Convention¶

To keep datasets organized, we use the format:

Components:¶

  • platform:

    • ss = Smart-seq
    • ds = Drop-seq
  • cell:

    • mcf7
    • hcc (for HCC1806)
  • stage:

    • raw = unfiltered
    • filt = filtered
    • norm = filtered + normalized

Examples:¶

Description Variable Name
Smart-seq unfiltered MCF7 ss_mcf7_raw
Smart-seq filtered MCF7 ss_mcf7_filt
Smart-seq filtered + normalized MCF7 ss_mcf7_norm
Smart-seq unfiltered HCC1806 ss_hcc_raw
Smart-seq filtered + normalized HCC1806 ss_hcc_norm
Drop-seq filtered MCF7 ds_mcf7_filt
Drop-seq filtered + normalized MCF7 ds_mcf7_norm
Drop-seq filtered HCC1806 ds_hcc_filt
Drop-seq filtered + normalized HCC1806 ds_hcc_norm

Imports¶

In [1]:
# Standard library
import math
from itertools import combinations
from types import ModuleType
from typing import Any, Callable

# Third-party libraries
import anndata
import anndata as ad
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.graph_objs as go
import plotly.io as pio
import scanpy as sc
import seaborn as sns

# SciPy
from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.stats import kurtosis, mode, skew

# Matplotlib
from matplotlib.patches import Patch
from matplotlib.ticker import FixedLocator, FixedFormatter

# scikit-learn
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.exceptions import ConvergenceWarning
from sklearn.feature_selection import RFECV, SelectFromModel, SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn.manifold import trustworthiness
from sklearn.metrics import (
    accuracy_score,
    adjusted_rand_score,
    classification_report,
    confusion_matrix,
    normalized_mutual_info_score,
    silhouette_samples,
    silhouette_score,
)
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    learning_curve,
    train_test_split,
)
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import LinearSVC, SVC
In [2]:
import warnings
warnings.resetwarnings()
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category = ConvergenceWarning)

Meta Data¶

In [3]:
# META DATA
ss_mcf7_meta = pd.read_csv("AILab2025/SmartSeq/MCF7_SmartS_MetaData.tsv",delimiter="\t",engine='python',index_col=0)
ss_mcf7_meta.head(5)
Out[3]:
Cell Line Lane Pos Condition Hours Cell name PreprocessingTag ProcessingComments
Filename
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam MCF7 output.STAR.1 A10 Hypo 72 S28 Aligned.sortedByCoord.out.bam STAR,FeatureCounts
output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam MCF7 output.STAR.1 A11 Hypo 72 S29 Aligned.sortedByCoord.out.bam STAR,FeatureCounts
output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam MCF7 output.STAR.1 A12 Hypo 72 S30 Aligned.sortedByCoord.out.bam STAR,FeatureCounts
output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam MCF7 output.STAR.1 A1 Norm 72 S1 Aligned.sortedByCoord.out.bam STAR,FeatureCounts
output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam MCF7 output.STAR.1 A2 Norm 72 S2 Aligned.sortedByCoord.out.bam STAR,FeatureCounts

Here we see what information the name of each cell contains, which will be useful later - especially the condition (Hypo/Norm) which is what we ultimately want to predict.

Unfiltered SmartSeq MCF7¶

Exploration¶

In this initial exploration step, we load the unfiltered Smart-Seq file for the MCF7 cell line and examine its dimensions and gene identifiers, as well as inspect basic data quality metrics. Specifically, we:

  • read in the raw counts table (genes × cells)
  • print the overall shape to see how many genes and cells we have
  • see the first few rows to verify the per-cell expression values
  • use .describe() to summarize distributions across cells
  • check for any missing values

This quick scan gives us confidence that the data are loaded correctly and sets the stage for filtering, normalization, and more detailed analysis.

In [4]:
ss_mcf7_raw = pd.read_csv("AILab2025/SmartSeq/MCF7_SmartS_Unfiltered_Data.txt",delimiter=" ",engine='python',index_col=0)
gene_symbls = ss_mcf7_raw.index
print("Dataframe indexes: ", gene_symbls)
ss_mcf7_raw.shape
Dataframe indexes:  Index(['WASH7P', 'MIR6859-1', 'WASH9P', 'OR4F29', 'MTND1P23', 'MTND2P28',
       'MTCO1P12', 'MTCO2P12', 'MTATP8P1', 'MTATP6P1',
       ...
       'MT-TH', 'MT-TS2', 'MT-TL2', 'MT-ND5', 'MT-ND6', 'MT-TE', 'MT-CYB',
       'MT-TT', 'MT-TP', 'MAFIP'],
      dtype='object', length=22934)
Out[4]:
(22934, 383)
In [5]:
# How much of each gene (row) is in each cell (column)
ss_mcf7_raw.head(5)
Out[5]:
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam output.STAR.1_A5_Norm_S5_Aligned.sortedByCoord.out.bam output.STAR.1_A6_Norm_S6_Aligned.sortedByCoord.out.bam output.STAR.1_A7_Hypo_S25_Aligned.sortedByCoord.out.bam ... output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam output.STAR.4_H2_Norm_S356_Aligned.sortedByCoord.out.bam output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam
WASH7P 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 1 0 1
MIR6859-1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
WASH9P 1 0 0 0 0 1 10 1 0 0 ... 1 1 0 0 0 0 1 1 4 5
OR4F29 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
MTND1P23 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0

5 rows × 383 columns

In [6]:
ss_mcf7_raw.describe()
Out[6]:
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam output.STAR.1_A5_Norm_S5_Aligned.sortedByCoord.out.bam output.STAR.1_A6_Norm_S6_Aligned.sortedByCoord.out.bam output.STAR.1_A7_Hypo_S25_Aligned.sortedByCoord.out.bam ... output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam output.STAR.4_H2_Norm_S356_Aligned.sortedByCoord.out.bam output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam
count 22934.000000 22934.000000 22934.000000 22934.000000 22934.000000 22934.000000 22934.000000 22934.000000 22934.000000 22934.000000 ... 22934.000000 22934.000000 22934.000000 22934.000000 22934.000000 22934.000000 22934.000000 22934.000000 22934.000000 22934.000000
mean 40.817651 0.012253 86.442400 1.024636 14.531351 56.213613 75.397183 62.767725 67.396747 2.240734 ... 17.362562 42.080230 34.692422 32.735284 21.992718 17.439391 49.242784 61.545609 68.289352 62.851400
std 465.709940 0.207726 1036.572689 6.097362 123.800530 503.599145 430.471519 520.167576 459.689019 25.449630 ... 193.153757 256.775704 679.960908 300.291051 153.441647 198.179666 359.337479 540.847355 636.892085 785.670341
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 17.000000 0.000000 5.000000 0.000000 7.000000 23.000000 39.000000 35.000000 38.000000 1.000000 ... 9.000000 30.000000 0.000000 17.000000 12.000000 9.000000 27.000000 30.000000 38.000000 33.000000
max 46744.000000 14.000000 82047.000000 289.000000 10582.000000 46856.000000 29534.000000 50972.000000 36236.000000 1707.000000 ... 17800.000000 23355.000000 81952.000000 29540.000000 12149.000000 19285.000000 28021.000000 40708.000000 46261.000000 68790.000000

8 rows × 383 columns

In [7]:
# MISSING VALUES
ss_mcf7_raw.isnull().values.any()
Out[7]:
np.False_

Gene Counts¶

In this section we:

  • add up all the reads in each cell to see how many genes we detect per sample
  • make a bar chart (colored by hypoxia vs. normoxia) to spot any differences
  • group samples by their ID letters and count how many hypoxic and normoxic cells are in each group

This helps us check if one condition has consistently more or fewer detected genes before we move on.

In [8]:
ss_mcf7_raw_small = ss_mcf7_raw.iloc[:, 150:220]
column_sums = ss_mcf7_raw_small.sum(axis=0)
column_sums_sorted = column_sums.sort_values(ascending=False)

sorted_labels = column_sums_sorted.index

clean_labels = sorted_labels.str.replace(r"output\.STAR\.", "", regex=True)
clean_labels = clean_labels.str.replace(r"_Aligned\.sortedByCoord\.out\.bam", "", regex=True)

colors = [
    'royalblue' if 'Hypo' in label else
    'seagreen' if 'Norm' in label else
    'gray'
    for label in clean_labels
]

plt.figure(figsize=(14,8))
plt.bar(clean_labels, column_sums_sorted.values, color=colors)
plt.xticks(rotation=90, fontsize=8)
plt.title('Total Number of Genes per Cell Type')
plt.xlabel('Cell Type')
plt.ylabel('Total Gene Count')
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [9]:
column_sums = ss_mcf7_raw.sum(axis=0)
column_sums_sorted = column_sums.sort_values(ascending=False)

sorted_labels = column_sums_sorted.index

clean_labels = sorted_labels.str.replace(r"output\.STAR\.", "", regex=True)
clean_labels = clean_labels.str.replace(r"_Aligned\.sortedByCoord\.out\.bam", "", regex=True)

# Extract letter from label (after first underscore)
first_letters = clean_labels.str.extract(r'_(\w)')[0]
print(first_letters)

from collections import defaultdict

# Initialize counters
group_counts = defaultdict(lambda: {'Hypo': 0, 'Norm': 0})

for idx, name in enumerate(clean_labels):
    # Extract the letter after first underscore
    letter = first_letters[idx]

    # Check condition
    if 'hypo' in name.lower():
        group_counts[letter]['Hypo'] += 1
    elif 'norm' in name.lower():
        group_counts[letter]['Norm'] += 1

print(group_counts)
0      C
1      B
2      C
3      A
4      A
      ..
378    E
379    H
380    D
381    G
382    H
Name: 0, Length: 383, dtype: object
defaultdict(<function <lambda> at 0x2bbcb47c0>, {'C': {'Hypo': 24, 'Norm': 24}, 'B': {'Hypo': 24, 'Norm': 24}, 'A': {'Hypo': 24, 'Norm': 24}, 'E': {'Hypo': 24, 'Norm': 24}, 'D': {'Hypo': 24, 'Norm': 24}, 'F': {'Hypo': 24, 'Norm': 24}, 'G': {'Hypo': 24, 'Norm': 24}, 'H': {'Hypo': 23, 'Norm': 24}})

We conclude that classes are balanced and this is also true across the groups (ID letters).

Outliers¶

In [10]:
Q1 = ss_mcf7_raw.quantile(0.25)
Q3 = ss_mcf7_raw.quantile(0.75)
IQR = Q3 - Q1

# Keep only the rows that have no ouliers
ss_mcf7_raw_noOut = ss_mcf7_raw[~((ss_mcf7_raw < (Q1 - 1.5 * IQR)) | (ss_mcf7_raw > (Q3 + 1.5 * IQR))).any(axis=1)]

ss_mcf7_raw_noOut.shape
Out[10]:
(6435, 383)

IQR method removes 22934 - 6435 = 16,499 rows, which is roughly 72% of our data => not valid, our data is too sparse for this approach.

Quality Control & Violin Plots¶

In this section, we:

  • Compute per-cell QC metrics:
    • total counts
    • number of genes detected
    • percent mitochondrial reads (MT-genes as a fraction of total)
    • percent zeros (dropouts)
  • visualize distributions with histograms and violin plots to spot outliers or skewed distributions
  • filter out low-quality cells using intuitive thresholds (e.g., <2,000 genes, <100,000 reads, >10% mitochondrial), then re-plot the post-filter distributions to confirm that most remaining cells lie within acceptable ranges
In [11]:
# Create QC DataFrame
qc_ss_mcf7 = pd.DataFrame(index=ss_mcf7_raw.columns)

# Total counts
qc_ss_mcf7['total_counts'] = ss_mcf7_raw.sum(axis=0)
print("\nComputed total_counts per cell.")
print(qc_ss_mcf7['total_counts'].describe())

# Number of genes detected per cell
qc_ss_mcf7['n_genes'] = (ss_mcf7_raw > 0).sum(axis=0)
print("\nComputed n_genes per cell.")
print(qc_ss_mcf7['n_genes'].describe())

# Mitochondrial genes
mito_genes = [gene for gene in ss_mcf7_raw.index if gene.startswith("MT-") or gene.startswith("MT.")]
print(f"\nIdentified {len(mito_genes)} mitochondrial genes.")

# % Mitochondrial expression
qc_ss_mcf7['pct_mito'] = ss_mcf7_raw.loc[mito_genes].sum(axis=0) / qc_ss_mcf7['total_counts'] * 100
print("\nComputed percent mitochondrial gene expression per cell.")
print(qc_ss_mcf7['pct_mito'].describe())

# Percentage of Zeros per Sample
qc_ss_mcf7['percent_zeros'] = (ss_mcf7_raw == 0).sum(axis=0) / ss_mcf7_raw.shape[0] * 100
Computed total_counts per cell.
count    3.830000e+02
mean     9.946119e+05
std      5.503732e+05
min      1.000000e+00
25%      5.987505e+05
50%      1.129334e+06
75%      1.408638e+06
max      2.308057e+06
Name: total_counts, dtype: float64

Computed n_genes per cell.
count      383.000000
mean      9124.219321
std       2693.309249
min          1.000000
25%       8456.500000
50%       9907.000000
75%      10789.000000
max      12519.000000
Name: n_genes, dtype: float64

Identified 36 mitochondrial genes.

Computed percent mitochondrial gene expression per cell.
count    383.000000
mean       1.911659
std        2.355400
min        0.000000
25%        0.740893
50%        1.528072
75%        2.597771
max       31.033833
Name: pct_mito, dtype: float64
In [12]:
fig, axs = plt.subplots(1, 3, figsize=(15, 4))

axs[0].hist(qc_ss_mcf7['total_counts'], bins=30, color='gray')
axs[0].set_title("Total Counts per Sample")
axs[0].set_xlabel("Total Counts")

axs[1].hist(qc_ss_mcf7['n_genes'], bins=30, color='steelblue')
axs[1].set_title("Number of Genes per Sample")
axs[1].set_xlabel("Genes Detected")

axs[2].hist(qc_ss_mcf7['percent_zeros'], bins=30, color='darkred')
axs[2].set_title("% Zeros per Sample")
axs[2].set_xlabel("Percent Zeros")

plt.tight_layout()
plt.show()
No description has been provided for this image

We see a long tail of low-count cells — those below ~100,000 reads will be removed.

In [13]:
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

sns.violinplot(y=qc_ss_mcf7['total_counts'], ax=axes[0])
axes[0].set_title("Total Counts per Cell")

sns.violinplot(y=qc_ss_mcf7['n_genes'], ax=axes[1])
axes[1].set_title("Number of Genes per Cell")

sns.violinplot(y=qc_ss_mcf7['pct_mito'], ax=axes[2])
axes[2].set_title("Mitochondrial Gene %")

plt.tight_layout()
plt.show()
No description has been provided for this image

Most cells cluster around 5,000–10,000 detected genes, but a few drop below 2,000.

In [14]:
# QC Scatter Plot
adata = ad.AnnData(X=ss_mcf7_raw.T)

adata.obs['total_counts'] = qc_ss_mcf7['total_counts']
adata.obs['n_genes_by_counts'] = qc_ss_mcf7['n_genes']
adata.obs['pct_counts_mt'] = qc_ss_mcf7['pct_mito']

sc.pl.scatter(
    adata,
    x="total_counts",
    y="n_genes_by_counts",
    color="pct_counts_mt"
)
No description has been provided for this image

This scatter plot visualizes key quality metrics for each cell:

  • X-axis: Total number of transcripts detected per cell (total_counts)
  • Y-axis: Number of unique genes detected per cell (n_genes_by_counts)
  • Color: Proportion of reads mapping to mitochondrial genes (pct_counts_mt), a known marker of cell stress or apoptosis.

Most cells show a healthy profile with:

  • High gene detection
  • Moderate transcript counts
  • Low mitochondrial content (dark colors)

However, a few outliers have:

  • Low gene counts
  • High mitochondrial percentages (bright yellow points)

These may represent low-quality or dying cells and are typically filtered out in preprocessing to improve downstream analyses.

High mitochondrial gene expression in a cell usually indicates poor quality, often because the cell was:

  • Stressed
  • Dying or partially lysed
  • Degraded

Now we filter the data:

In [15]:
min_genes = 2_000  # Cells with very low gene counts (< 2000) should be filtered out
min_counts = 100_000  # Cells with extremely low counts may be low-quality
max_mito = 10  # A common threshold is 5%-10% to flag high-mito cells

high_quality_cells = qc_ss_mcf7[
    (qc_ss_mcf7['n_genes'] > min_genes) &
    (qc_ss_mcf7['total_counts'] > min_counts) &
    (qc_ss_mcf7['pct_mito'] < max_mito)
]

# Retain only the high-quality columns (cells)
ss_mcf7_raw_filt = ss_mcf7_raw[high_quality_cells.index]

print(f"Original: {ss_mcf7_raw.shape[1]} cells")
print(f"Filtered: {ss_mcf7_raw_filt.shape[1]} cells")
Original: 383 cells
Filtered: 337 cells

Filtering removed 46 cells. Let's see how the violin and QC scatter plots look now.

In [16]:
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

sns.violinplot(y=high_quality_cells['total_counts'], ax=axes[0])
axes[0].set_title("Total Counts per Cell")

sns.violinplot(y=high_quality_cells['n_genes'], ax=axes[1])
axes[1].set_title("Number of Genes per Cell")

sns.violinplot(y=high_quality_cells['pct_mito'], ax=axes[2])
axes[2].set_title("Mitochondrial Gene %")

plt.tight_layout()
plt.show()
No description has been provided for this image

After filtering, the distributions tighten, indicating that outlier cells have been successfully removed.

In [17]:
adata = ad.AnnData(X=ss_mcf7_raw_filt.T)

adata.obs['total_counts'] = high_quality_cells['total_counts']
adata.obs['n_genes_by_counts'] = high_quality_cells['n_genes']
adata.obs['pct_counts_mt'] = high_quality_cells['pct_mito']

sc.pl.violin(
    adata,
    ["total_counts", "n_genes_by_counts", "pct_counts_mt"],
    jitter=0.4,
    multi_panel=True
)
No description has been provided for this image
In [18]:
sc.pl.scatter(
    adata,
    x="total_counts",
    y="n_genes_by_counts",
    color="pct_counts_mt"
)
No description has been provided for this image

The scatter plot confirms that we have removed cells with extremely low gene counts, low total counts and high mitochondrial content (the heatmap on the rigt is on a much lower scale).

Duplicates¶

First, we remove genes (rows) that have zero expression across all cells. These genes contain no information and contribute neither to biological signal nor technical variation. Keeping them would only increase dimensionality and computational load without adding value.

In [19]:
# Check original number of genes (rows)
original_rows = ss_mcf7_raw_filt.shape[0]

# Drop genes with all-zero expression
ss_mcf7_raw_filt = ss_mcf7_raw_filt.loc[~(ss_mcf7_raw_filt == 0).all(axis=1)]

# Check number of rows after dropping
remaining_rows = ss_mcf7_raw_filt.shape[0]

# Compute how many were dropped
dropped_rows = original_rows - remaining_rows
print(f"Number of all-zero rows dropped: {dropped_rows}")
Number of all-zero rows dropped: 36

Next, we remove duplicate genes — that is, genes that have identical expression profiles across all cells.

This can happen due to:

  • redundant gene IDs
  • dummy genes or technical artifacts
  • perfectly zeroed-out rows (common in sparse data)

We drop all but the first occurrence of each set of duplicate rows:

In [20]:
duplicate_rows = ss_mcf7_raw_filt[ss_mcf7_raw_filt.duplicated(keep=False)]
print("number of duplicate rows: ", duplicate_rows.shape[0])
print("Rows before:", ss_mcf7_raw_filt.shape[0])
ss_mcf7_raw_filt = ss_mcf7_raw_filt.drop_duplicates()
print("Rows after :", ss_mcf7_raw_filt.shape[0])
number of duplicate rows:  98
Rows before: 22898
Rows after : 22843

We do a quick check to make sure that there are no cells (columns) with zero expression across all genes:

In [21]:
zero_cols = (ss_mcf7_raw_filt == 0).all(axis=0)
print(f"Number of all-zero columns: {zero_cols.sum()}")
Number of all-zero columns: 0

Skeweness & Kurtosis¶

In this “Skewness and Kurtosis” step, we check how lopsided and heavy-tailed our per-cell expression profiles are, both before and after a simple log2 transformation:

  • skewness tells us if a cell’s expression values lean more to one side (positive skew means a long right tail; negative skew means a long left tail)
  • kurtosis measures how heavy those tails are (high kurtosis means more extreme outliers)
In [22]:
from scipy.stats import kurtosis, skew
colN = np.shape(ss_mcf7_raw_filt)[1]
colN
df_skew_cells = []
cnames = ss_mcf7_raw_filt.columns

for i in range(colN) :     
     v_df = ss_mcf7_raw_filt[cnames[i]]
     df_skew_cells += [skew(v_df)]   
  #  df_skew_cells += [df[cnames[i]].skew()]
df_skew_cells
sns.histplot(df_skew_cells,bins=100)
plt.xlabel('Skewness of single cells expression profiles - ss_mcf7_raw_filt')
Out[22]:
Text(0.5, 0, 'Skewness of single cells expression profiles - ss_mcf7_raw_filt')
No description has been provided for this image

Here we see that most cells have skewness values between 40 and 70, with a peak around 50–60. This indicates that the expression distributions are strongly right-skewed, which is expected in single-cell RNA-seq data due to many low-expression genes and a few highly expressed ones. The consistent skewness across cells reflects the sparse nature of the data, though extremely high or low skewness values may indicate outliers or technical artifacts.

In [23]:
df_kurt_cells = []
for i in range(colN) :     
     v_df = ss_mcf7_raw_filt[cnames[i]]
     df_kurt_cells += [kurtosis(v_df)]   
df_kurt_cells
sns.histplot(df_kurt_cells,bins=100)
plt.xlabel('Kurtosis of single cells expression profiles - ss_mcf7_raw_filt')
Out[23]:
Text(0.5, 0, 'Kurtosis of single cells expression profiles - ss_mcf7_raw_filt')
No description has been provided for this image
  • The kurtosis distribution is right-skewed, with a long tail toward higher kurtosis values.
  • Most cells fall within the 2,000–6,000 kurtosis range.
  • A few cells show extremely high kurtosis (>10,000), which are potential outliers.
  • The distribution is highly non-normal.

To reduce skew and heavy tails, we apply a log2(x+1) transform.

In [24]:
# DATA TRANSFORMATION
ss_mcf7_raw_filt_log = np.log2(ss_mcf7_raw_filt + 1)  # genes × cells
ss_mcf7_raw_filt_T = ss_mcf7_raw_filt.T
ss_mcf7_raw_filt_T_log = np.log2(ss_mcf7_raw_filt_T + 1)  # cells × genes (transpose necessary for skew() and kurtosis())
In [25]:
# Skeweness and Kurtosis should be fixed now
print("Before data transformation:", skew(ss_mcf7_raw_filt.T.values.flatten()), kurtosis(ss_mcf7_raw_filt.T.values.flatten()))
print("After data transformation:", skew(ss_mcf7_raw_filt_T_log.values.flatten()), kurtosis(ss_mcf7_raw_filt_T_log.values.flatten()))
Before data transformation: 85.61749272454507 11944.74886564649
After data transformation: 0.9824438044182301 -0.3223454127671732
In [26]:
colN = ss_mcf7_raw_filt_log.shape[1]
cnames = ss_mcf7_raw_filt_log.columns

# Compute skeweness for each cell
df_skew_cells_log = []
for i in range(colN):
    v_df = ss_mcf7_raw_filt_log[cnames[i]]
    df_skew_cells_log.append(skew(v_df))

# Plot histogram
sns.histplot(df_skew_cells_log, bins=100)
plt.xlabel('Skewness of single cells expression profiles (log2 transformed)')
plt.title('Distribution of Skewness (log2-transformed MCF7)')
plt.tight_layout()
plt.show()
No description has been provided for this image

Post‐transform, the skewness distribution tightens around zero, indicating a more symmetric profile.

In [27]:
# Compute kurtosis for each cell
df_kurt_cells_log = []
for i in range(colN):     
    v_df = ss_mcf7_raw_filt_log[cnames[i]]
    df_kurt_cells_log.append(kurtosis(v_df))

# Plot histogram of kurtosis
sns.histplot(df_kurt_cells_log, bins=100)
plt.xlabel('Kurtosis of single cells expression profiles (log2-transformed)')
plt.title('Distribution of Kurtosis - log2(ss_mcf7_raw_filt + 1)')
plt.tight_layout()
plt.show()
No description has been provided for this image

And the kurtosis drops towards a normal range, meaning fewer extreme outliers remain.

Train a linear classifier in PCA space¶

In this section, we use Principal Component Analysis (PCA) to compress our filtered, log-transformed MCF7 expression profiles into their top axes of variation, then train a simple logistic-regression model directly on those axes to distinguish hypoxic from normoxic samples. By projecting into 2D and 3D PCA space, we can visually assess how well the two conditions separate, and by fitting a linear classifier we quantify how much of that separation is captured by a single decision plane.

In [28]:
# 1. Transpose the DataFrame so that rows = samples, columns = genes
ss_mcf7_raw_filt_T = ss_mcf7_raw_filt.T  # transpose necessary for pca.fit_transform()

# 2. Log-transform if data isn't already normalized
ss_mcf7_raw_filt_T_log = np.log2(ss_mcf7_raw_filt_T + 1)
In [29]:
# 3. Generate colors based on column/sample names
colors = ['royalblue' if 'hypo' in name.lower() else 'seagreen' for name in ss_mcf7_raw_filt_T_log.index]

# 4. Run PCA with 2 components
pca = PCA(n_components=2)
pc = pca.fit_transform(ss_mcf7_raw_filt_T_log)

# 5. Plot
plt.figure(figsize=(8,6))
plt.scatter(pc[:,0], pc[:,1], c=colors)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)')
plt.title("PCA of Samples (Colored by Condition)")
plt.grid(True)

legend_elements = [
    Patch(facecolor='seagreen', label='Normoxia'),
    Patch(facecolor='royalblue', label='Hypoxia')
]
plt.legend(handles=legend_elements, title='Condition')

plt.tight_layout()
plt.show()
No description has been provided for this image

The 2D projection shows that hypoxic (blue) and normoxic (green) samples form distinct clusters along PC1 and PC2.

In [30]:
# Run PCA with 3 components
pca = PCA(n_components=3)
pc = pca.fit_transform(ss_mcf7_raw_filt_T_log)
In [31]:
pio.renderers.default = 'browser'

# Extract labels from sample names
labels = ['Hypo' if 'hypo' in name.lower() else 'Norm' for name in ss_mcf7_raw_filt_T_log.index]

# Get indices for each condition
hypo_idx = np.array(labels) == 'Hypo'
norm_idx = np.array(labels) == 'Norm'

fig = go.Figure()

# Hypo samples
fig.add_trace(go.Scatter3d(
    x=pc[hypo_idx, 0],
    y=pc[hypo_idx, 1],
    z=pc[hypo_idx, 2],
    mode='markers',
    name='Hypo',
    marker=dict(color='royalblue', size=6),
    text=ss_mcf7_raw_filt_T_log.index[hypo_idx],
    hoverinfo='text'
))

# Norm samples
fig.add_trace(go.Scatter3d(
    x=pc[norm_idx, 0],
    y=pc[norm_idx, 1],
    z=pc[norm_idx, 2],
    mode='markers',
    name='Norm',
    marker=dict(color='seagreen', size=6),
    text=ss_mcf7_raw_filt_T_log.index[norm_idx],
    hoverinfo='text'
))

# Set layout
fig.update_layout(
    scene=dict(
        xaxis_title=f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)',
        yaxis_title=f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)',
        zaxis_title=f'PC3 ({pca.explained_variance_ratio_[2]*100:.1f}%)'
    ),
    title='3D PCA of Samples (Interactive)',
    margin=dict(l=0, r=0, b=0, t=40)
)

fig.show()

This is an interacive plot opening on the browser. In 3D PCA space, the two conditions appear even more separable — this suggests the linear boundary may achieve high classification accuracy. Next we fit a logistic regression on the three PCA coordinates to find the optimal separating plane between hypoxic and normoxic samples.

In [32]:
labels = [1 if 'hypo' in name.lower() else 0 for name in ss_mcf7_raw_filt_T_log.index]

clf = LogisticRegression()
clf.fit(pc, labels)

# Extract coefficients (normal vector to the plane)
w = clf.coef_[0]  # [w1, w2, w3]
b = clf.intercept_[0]

# Create grid to cover the PCA space
x_range = np.linspace(pc[:, 0].min(), pc[:, 0].max(), 10)
y_range = np.linspace(pc[:, 1].min(), pc[:, 1].max(), 10)
xx, yy = np.meshgrid(x_range, y_range)

# Compute corresponding z for the plane
zz = (-w[0] * xx - w[1] * yy - b) / w[2]

# Compute decision values for all points
decision_values = np.dot(pc, w) + b

# Predicted labels: 1 if value > 0 (Hypo), else 0 (Norm)
predicted_labels = (decision_values > 0).astype(int)

The coefficient vector ‘w‘ and intercept ‘b‘ define our decision plane in PCA space.

In [33]:
labels = np.array([1 if 'hypo' in name.lower() else 0 for name in ss_mcf7_raw_filt_T_log.index])

accuracy = (predicted_labels == labels).mean()
print(f"Accuracy of plane: {accuracy * 100:.2f}%")
Accuracy of plane: 99.41%

Result: Our linear classifier achieves ~99.41% accuracy, confirming that PCA projection retains the key signal distinguishing hypoxia from normoxia. This suggests that the Smart-seq MCF7 data is highly separable and well-suited for supervised learning.

Unfiltered SmartSeq HCC1806¶

Exploration¶

In this first step for the HCC1806 line, similarly to previous cell line, we:

  • load the unfiltered Smart-Seq expression matrix and grab the gene symbols as row labels
  • check the overall shape to see how many genes and samples we have
  • view the first few rows to confirm the per-cell read counts look sensible
  • summarize basic statistics with .describe() to inspect ranges, means, and quartiles
  • verify there are no missing values that could interfere with our analysis

This quick check ensures that the HCC1806 data are correctly loaded and free of major issues before we move on to filtering, normalization, and deeper quality control.

In [34]:
ss_hcc_raw = pd.read_csv("AILab2025/SmartSeq/HCC1806_SmartS_Unfiltered_Data.txt",delimiter=" ",engine='python',index_col=0)
gene_symbls = ss_hcc_raw.index
print("Dataframe indexes: ", gene_symbls)
ss_hcc_raw.shape
Dataframe indexes:  Index(['WASH7P', 'CICP27', 'DDX11L17', 'WASH9P', 'OR4F29', 'MTND1P23',
       'MTND2P28', 'MTCO1P12', 'MTCO2P12', 'MTATP8P1',
       ...
       'MT-TH', 'MT-TS2', 'MT-TL2', 'MT-ND5', 'MT-ND6', 'MT-TE', 'MT-CYB',
       'MT-TT', 'MT-TP', 'MAFIP'],
      dtype='object', length=23396)
Out[34]:
(23396, 243)
In [35]:
ss_hcc_raw.head(5)
Out[35]:
output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A12_Normoxia_S26_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A4_Hypoxia_S8_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A5_Hypoxia_S108_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A6_Hypoxia_S11_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A7_Normoxia_S113_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A8_Normoxia_S119_Aligned.sortedByCoord.out.bam ... output.STAR.PCRPlate4G12_Normoxia_S243_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4G1_Hypoxia_S193_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4G2_Hypoxia_S198_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4G6_Hypoxia_S232_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4G7_Normoxia_S204_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4H10_Normoxia_S210_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4H11_Normoxia_S214_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4H7_Normoxia_S205_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4H9_Normoxia_S236_Aligned.sortedByCoord.out.bam
WASH7P 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
CICP27 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
DDX11L17 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
WASH9P 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 1 0 1 0 0
OR4F29 2 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 243 columns

In [36]:
ss_hcc_raw.describe()
Out[36]:
output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A12_Normoxia_S26_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A4_Hypoxia_S8_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A5_Hypoxia_S108_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A6_Hypoxia_S11_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A7_Normoxia_S113_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A8_Normoxia_S119_Aligned.sortedByCoord.out.bam ... output.STAR.PCRPlate4G12_Normoxia_S243_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4G1_Hypoxia_S193_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4G2_Hypoxia_S198_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4G6_Hypoxia_S232_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4G7_Normoxia_S204_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4H10_Normoxia_S210_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4H11_Normoxia_S214_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4H7_Normoxia_S205_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4H9_Normoxia_S236_Aligned.sortedByCoord.out.bam
count 23396.000000 23396.000000 23396.000000 23396.000000 23396.000000 23396.000000 23396.000000 23396.000000 23396.000000 23396.000000 ... 23396.000000 23396.000000 23396.000000 23396.000000 23396.000000 23396.000000 23396.000000 23396.000000 23396.000000 23396.000000
mean 99.565695 207.678278 9.694734 150.689007 35.700504 47.088434 152.799453 135.869422 38.363908 45.512139 ... 76.361771 105.566593 54.026116 29.763806 28.905411 104.740725 35.181569 108.197940 37.279962 76.303855
std 529.532443 981.107905 65.546050 976.936548 205.885369 545.367706 864.974182 870.729740 265.062493 366.704721 ... 346.659348 536.881574 344.068304 186.721266 135.474736 444.773045 170.872090 589.082268 181.398951 369.090274
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 1.000000 0.000000 0.000000 0.000000 0.000000 2.000000 0.000000 0.000000 0.000000 1.000000
75% 51.000000 125.000000 5.000000 40.000000 22.000000 17.000000 81.000000 76.000000 22.000000 18.000000 ... 56.000000 67.000000 29.000000 18.000000 19.000000 76.000000 24.000000 68.000000 22.000000 44.000000
max 35477.000000 69068.000000 6351.000000 70206.000000 17326.000000 47442.000000 43081.000000 62813.000000 30240.000000 35450.000000 ... 19629.000000 30987.000000 21894.000000 13457.000000 11488.000000 33462.000000 15403.000000 34478.000000 10921.000000 28532.000000

8 rows × 243 columns

In [37]:
# MISSING VALUES
ss_hcc_raw.isnull().values.any()
Out[37]:
np.False_

Gene Counts¶

In the “Gene Counts” step for HCC1806, we:

  • sum the read counts in each cell
  • clean up the sample names and color-code them by condition (hypoxia vs. normoxia)
  • plot a bar chart of total gene counts per cell, so we can quickly spot if one condition systematically yields more or fewer detected genes
In [38]:
ss_hcc_raw_small = ss_hcc_raw.iloc[:, 150:220]
column_sums = ss_hcc_raw_small.sum(axis=0)
column_sums_sorted = column_sums.sort_values(ascending=False)

sorted_labels = column_sums_sorted.index

clean_labels = sorted_labels.str.replace(r"output\.STAR\.", "", regex=True)
clean_labels = clean_labels.str.replace(r"_Aligned\.sortedByCoord\.out\.bam", "", regex=True)

colors = [
    'royalblue' if 'Hypo' in label else
    'seagreen' if 'Norm' in label else
    'gray'
    for label in clean_labels
]

plt.figure(figsize=(14,8))
plt.bar(clean_labels, column_sums_sorted.values, color=colors)
plt.xticks(rotation=90, fontsize=8)
plt.title('Total Number of Genes per Cell Type')
plt.xlabel('Cell Type')
plt.ylabel('Total Gene Count')
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()
No description has been provided for this image

A handful of samples show zero counts across all genes — these empty profiles should be filtered out as they won’t contribute any useful information.

Outliers¶

In [39]:
Q1 = ss_hcc_raw.quantile(0.25)
Q3 = ss_hcc_raw.quantile(0.75)
IQR = Q3 - Q1

# Keep only the rows that have no ouliers
ss_hcc_raw_noOut = ss_hcc_raw[~((ss_hcc_raw < (Q1 - 1.5 * IQR)) | (ss_hcc_raw > (Q3 + 1.5 * IQR))).any(axis=1)]

ss_hcc_raw_noOut.shape
Out[39]:
(10815, 243)

IQR method removes 23396 - 10815 = 12581 rows, which is roughly 54% of our data => not valid, our data is too sparse for this approach

Quality Control & Violin Plots¶

In this section, we calculate and visualize key QC metrics for the HCC1806 single-cell data to identify and remove low-quality cells before further analysis. We:

  • compute per-cell metrics: total read counts, number of genes detected, percent mitochondrial reads, and percent zeros
  • plot distributions with histograms to flag outliers, then violin plots to compare density and spread across cells
  • filter cells using intuitive thresholds (e.g. >2,000 genes, >100,000 reads, <10% mito), and re-plot the post-filter metrics
In [40]:
# Create QC DataFrame
qc_ss_hcc = pd.DataFrame(index=ss_hcc_raw.columns)

# Total counts
qc_ss_hcc['total_counts'] = ss_hcc_raw.sum(axis=0)
print("\nComputed total_counts per cell.")
print(qc_ss_hcc['total_counts'].describe())

# Number of genes detected per cell
qc_ss_hcc['n_genes'] = (ss_hcc_raw > 0).sum(axis=0)
print("\nComputed n_genes per cell.")
print(qc_ss_hcc['n_genes'].describe())

# Mitochondrial genes
mito_genes = [gene for gene in ss_hcc_raw.index if gene.startswith("MT-") or gene.startswith("MT.")]
print(f"\nIdentified {len(mito_genes)} mitochondrial genes.")

# % Mitochondrial expression
qc_ss_hcc['pct_mito'] = ss_hcc_raw.loc[mito_genes].sum(axis=0) / qc_ss_hcc['total_counts'] * 100
print("\nComputed percent mitochondrial gene expression per cell.")
print(qc_ss_hcc['pct_mito'].describe())

# Percentage of Zeros per Sample
qc_ss_hcc['percent_zeros'] = (ss_hcc_raw == 0).sum(axis=0) / ss_hcc_raw.shape[0] * 100
Computed total_counts per cell.
count    2.430000e+02
mean     2.012306e+06
std      1.171858e+06
min      1.140000e+02
25%      9.910625e+05
50%      2.067645e+06
75%      2.925182e+06
max      5.758132e+06
Name: total_counts, dtype: float64

Computed n_genes per cell.
count      243.000000
mean     10330.358025
std       2260.259356
min         35.000000
25%      10117.000000
50%      10831.000000
75%      11409.000000
max      13986.000000
Name: n_genes, dtype: float64

Identified 36 mitochondrial genes.

Computed percent mitochondrial gene expression per cell.
count    243.000000
mean       2.197282
std        3.173782
min        0.000000
25%        1.462458
50%        1.840945
75%        2.468950
max       49.215686
Name: pct_mito, dtype: float64
In [41]:
fig, axs = plt.subplots(1, 3, figsize=(15, 4))

axs[0].hist(qc_ss_hcc['total_counts'], bins=30, color='gray')
axs[0].set_title("Total Counts per Sample")
axs[0].set_xlabel("Total Counts")

axs[1].hist(qc_ss_hcc['n_genes'], bins=30, color='steelblue')
axs[1].set_title("Number of Genes per Sample")
axs[1].set_xlabel("Genes Detected")

axs[2].hist(qc_ss_hcc['percent_zeros'], bins=30, color='darkred')
axs[2].set_title("% Zeros per Sample")
axs[2].set_xlabel("Percent Zeros")

plt.tight_layout()
plt.show()
No description has been provided for this image

These histograms show wide variation across cells, including some very low-count or high-zero samples that should be filtered out.

In [42]:
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

sns.violinplot(y=qc_ss_hcc['total_counts'], ax=axes[0])
axes[0].set_title("Total Counts per Cell")

sns.violinplot(y=qc_ss_hcc['n_genes'], ax=axes[1])
axes[1].set_title("Number of Genes per Cell")

sns.violinplot(y=qc_ss_hcc['pct_mito'], ax=axes[2])
axes[2].set_title("Mitochondrial Gene %")

plt.tight_layout()
plt.show()
No description has been provided for this image

Violin plots reveal that most cells fall within reasonable ranges, but tails indicate a few outliers.

In [43]:
adata = ad.AnnData(X=ss_hcc_raw.T)

adata.obs['total_counts'] = qc_ss_hcc['total_counts']
adata.obs['n_genes_by_counts'] = qc_ss_hcc['n_genes']
adata.obs['pct_counts_mt'] = qc_ss_hcc['pct_mito']

sc.pl.scatter(
    adata,
    x="total_counts",
    y="n_genes_by_counts",
    color="pct_counts_mt"
)
No description has been provided for this image

We apply thresholds (>2,000 genes, >100,000 reads, <10% mito) to keep only high-quality cells.

In [44]:
min_genes = 2_000  # Cells with very low gene counts (< 1000) should be filtered out
min_counts = 100_000  # Cells with extremely low counts may be low-quality
max_mito = 10  # A common threshold is 5%-10% to flag high-mito cells

high_quality_cells = qc_ss_hcc[
    (qc_ss_hcc['n_genes'] > min_genes) &
    (qc_ss_hcc['total_counts'] > min_counts) &
    (qc_ss_hcc['pct_mito'] < max_mito)
]

# Retain only the high-quality columns (cells)
ss_hcc_raw_filt = ss_hcc_raw[high_quality_cells.index]

print(f"Original: {ss_hcc_raw.shape[1]} cells")
print(f"Filtered: {ss_hcc_raw_filt.shape[1]} cells")
Original: 243 cells
Filtered: 233 cells

We filtered out 10 cells and now we inspect the plots again.

In [45]:
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

sns.violinplot(y=high_quality_cells['total_counts'], ax=axes[0])
axes[0].set_title("Total Counts per Cell")

sns.violinplot(y=high_quality_cells['n_genes'], ax=axes[1])
axes[1].set_title("Number of Genes per Cell")

sns.violinplot(y=high_quality_cells['pct_mito'], ax=axes[2])
axes[2].set_title("Mitochondrial Gene %")

plt.tight_layout()
plt.show()
No description has been provided for this image

After filtering, distributions tighten and outliers are removed, leaving a more homogeneous set of cells.

In [46]:
adata = ad.AnnData(X=ss_hcc_raw_filt.T)

adata.obs['total_counts'] = high_quality_cells['total_counts']
adata.obs['n_genes_by_counts'] = high_quality_cells['n_genes']
adata.obs['pct_counts_mt'] = high_quality_cells['pct_mito']

sc.pl.violin(
    adata,
    ["total_counts", "n_genes_by_counts", "pct_counts_mt"],
    jitter=0.4,
    multi_panel=True
)
No description has been provided for this image
In [47]:
sc.pl.scatter(
    adata,
    x="total_counts",
    y="n_genes_by_counts",
    color="pct_counts_mt"
)
No description has been provided for this image

The violin and QC scatter plots confirm that filtering has been successful.

Duplicates¶

First we check for genes with all-zero expression profiles:

In [48]:
# Check original number of genes (rows)
original_rows = ss_hcc_raw_filt.shape[0]

# Drop genes with all-zero expression
ss_hcc_raw_filt = ss_hcc_raw_filt.loc[~(ss_hcc_raw_filt == 0).all(axis=1)]

# Check number of rows after dropping
remaining_rows = ss_hcc_raw_filt.shape[0]

# Compute how many were dropped
dropped_rows = original_rows - remaining_rows
print(f"Number of all-zero rows dropped: {dropped_rows}")
Number of all-zero rows dropped: 0

Smart-seq HCC1806 has no all-zero rows.

In this step, we scan the filtered HCC1806 matrix for any genes that have identical expression profiles across all cells — likely redundant entries from upstream processing — and then remove these duplicates. By reporting the row count before and after, we ensure our feature set contains only unique gene measurements.

In [49]:
duplicate_rows = ss_hcc_raw_filt[ss_hcc_raw_filt.duplicated(keep=False)]
print("number of duplicate rows: ", duplicate_rows.shape[0])
print("Rows before:", ss_hcc_raw_filt.shape[0])
ss_hcc_raw_filt = ss_hcc_raw_filt.drop_duplicates()
print("Rows after :", ss_hcc_raw_filt.shape[0])
number of duplicate rows:  92
Rows before: 23396
Rows after : 23339

Skeweness & Kurtosis¶

In [50]:
# DATA TRANSFORMATION
ss_hcc_raw_filt_log = np.log2(ss_hcc_raw_filt + 1)  # genes × cells
ss_hcc_raw_filt_T = ss_hcc_raw_filt.T
ss_hcc_raw_filt_T_log = np.log2(ss_hcc_raw_filt_T + 1)  # cells × genes
In [51]:
print("Before data transformation:", skew(ss_hcc_raw_filt.T.values.flatten()), kurtosis(ss_hcc_raw_filt.T.values.flatten()))
print("After data transformation:", skew(ss_hcc_raw_filt_T_log.values.flatten()), kurtosis(ss_hcc_raw_filt_T_log.values.flatten()))
Before data transformation: 66.87855806046676 10509.765195762657
After data transformation: 0.8280895490120331 -0.7178446753891188

After applying a log2 transformation to the expression matrix, the overall skewness (0.83) and kurtosis (−0.72) indicate that the distribution of gene expression values per cell is now approximately symmetric and less heavy-tailed. This transformation improves the suitability of the data for downstream analyses such as PCA and clustering.

Train a linear classifier in PCA space¶

Here, we repeat the PCA plus logistic regression work for the HCC1806 dataset. First, we project our filtered, log2-transformed gene expressions into their top principal components — visualizing in 2D and 3D to see whether hypoxic and normoxic samples separate naturally. Then we fit a simple logistic-regression model on the 3D coordinates to define a linear decision plane that predicts each cell’s condition, and finally report its classification accuracy.

In [52]:
# 1. Transpose the DataFrame so that rows = samples, columns = genes
ss_hcc_raw_filt_T = ss_hcc_raw_filt.T  # transpose necessary for pca.fit_transform()

# 2. Log-transform if data isn't already normalized
ss_hcc_raw_filt_T_log = np.log2(ss_hcc_raw_filt_T + 1)
In [53]:
# 3. Generate colors based on column/sample names
colors = ['royalblue' if 'hypo' in name.lower() else 'seagreen' for name in ss_hcc_raw_filt_T_log.index]

# 4. Run PCA with 2 components
pca = PCA(n_components=2)
pc = pca.fit_transform(ss_hcc_raw_filt_T_log)

# 5. Plot
plt.figure(figsize=(8,6))
plt.scatter(pc[:,0], pc[:,1], c=colors)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)')
plt.title("PCA of Samples (Colored by Condition)")
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

The 2D PCA shows how much of the hypoxia vs. normoxia signal is captured by PC1 and PC2.

In [54]:
# Run PCA with 3 components
pca = PCA(n_components=3)
pc = pca.fit_transform(ss_hcc_raw_filt_T_log)
In [55]:
pio.renderers.default = 'browser'

# Extract labels from sample names
labels = ['Hypo' if 'hypo' in name.lower() else 'Norm' for name in ss_hcc_raw_filt_T_log.index]

# Get indices for each condition
hypo_idx = np.array(labels) == 'Hypo'
norm_idx = np.array(labels) == 'Norm'

fig = go.Figure()

# Hypo samples
fig.add_trace(go.Scatter3d(
    x=pc[hypo_idx, 0],
    y=pc[hypo_idx, 1],
    z=pc[hypo_idx, 2],
    mode='markers',
    name='Hypo',
    marker=dict(color='royalblue', size=6),
    text=ss_hcc_raw_filt_T_log.index[hypo_idx],
    hoverinfo='text'
))

# Norm samples
fig.add_trace(go.Scatter3d(
    x=pc[norm_idx, 0],
    y=pc[norm_idx, 1],
    z=pc[norm_idx, 2],
    mode='markers',
    name='Norm',
    marker=dict(color='seagreen', size=6),
    text=ss_hcc_raw_filt_T_log.index[norm_idx],
    hoverinfo='text'
))

# Set layout
fig.update_layout(
    scene=dict(
        xaxis_title=f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)',
        yaxis_title=f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)',
        zaxis_title=f'PC3 ({pca.explained_variance_ratio_[2]*100:.1f}%)'
    ),
    title='3D PCA of Samples (Interactive)',
    margin=dict(l=0, r=0, b=0, t=40)
)

fig.show()

This is an interacive plot opening in a browser. In 3D space, the two conditions form more distinct clusters, suggesting good separability. Next, we'll train a logistic-regression classifier on the three PCA axes to find the best separating plane.

In [56]:
labels = [1 if 'hypo' in name.lower() else 0 for name in ss_hcc_raw_filt_T_log.index]

clf = LogisticRegression()
clf.fit(pc, labels)

# Extract coefficients (normal vector to the plane)
w = clf.coef_[0]  # [w1, w2, w3]
b = clf.intercept_[0]

# Create grid to cover the PCA space
x_range = np.linspace(pc[:, 0].min(), pc[:, 0].max(), 10)
y_range = np.linspace(pc[:, 1].min(), pc[:, 1].max(), 10)
xx, yy = np.meshgrid(x_range, y_range)

# Compute corresponding z for the plane
zz = (-w[0] * xx - w[1] * yy - b) / w[2]

# Compute decision values for all points
decision_values = np.dot(pc, w) + b

# Predicted labels: 1 if value > 0 (Hypo), else 0 (Norm)
predicted_labels = (decision_values > 0).astype(int)

The weights w and intercept b define our decision boundary; samples with positive decision values are classified as hypoxic.

In [57]:
labels = np.array([1 if 'hypo' in name.lower() else 0 for name in ss_hcc_raw_filt_T_log.index])

accuracy = (predicted_labels == labels).mean()
print(f"Accuracy of plane: {accuracy * 100:.2f}%")
Accuracy of plane: 90.13%

The classifier achieves ~90.13% accuracy, indicating that first 3 PCA components retain sufficient information to distinguish hypoxia from normoxia. This accuracy, however, is lower than for MCF7, suggesting that the Smart-seq HCC1806 data exhibits more subtle expression differences between conditions, making separation more challenging for linear models.

Differences between "...unfiltered...txt", "...filtered...txt", and "..normalized...txt" data¶

Data Overview¶

In [58]:
ss_mcf7_filt = pd.read_csv("AILab2025/SmartSeq/MCF7_SmartS_Filtered_Data.txt",delimiter=" ",engine='python',index_col=0)
ss_mcf7_norm = pd.read_csv("AILab2025/SmartSeq/MCF7_SmartS_Filtered_Normalised_3000_Data_train.txt",delimiter=" ",engine='python',index_col=0)

ss_hcc_filt = pd.read_csv("AILab2025/SmartSeq/HCC1806_SmartS_Filtered_Data.txt",delimiter=" ",engine='python',index_col=0)
ss_hcc_norm = pd.read_csv("AILab2025/SmartSeq/HCC1806_SmartS_Filtered_Normalised_3000_Data_train.txt",delimiter=" ",engine='python',index_col=0)

Here we load the filtered data (genes × cells) and the normalized training sets (top 3,000 genes × cells) for both cell lines.

In [59]:
ss_mcf7_filt.describe()
Out[59]:
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam output.STAR.1_A5_Norm_S5_Aligned.sortedByCoord.out.bam output.STAR.1_A6_Norm_S6_Aligned.sortedByCoord.out.bam output.STAR.1_B11_Hypo_S77_Aligned.sortedByCoord.out.bam output.STAR.1_B12_Hypo_S78_Aligned.sortedByCoord.out.bam output.STAR.1_B4_Norm_S52_Aligned.sortedByCoord.out.bam ... output.STAR.4_H10_Hypo_S382_Aligned.sortedByCoord.out.bam output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam
count 18945.000000 18945.000000 18945.000000 18945.000000 18945.000000 18945.000000 18945.000000 18945.000000 18945.000000 18945.000000 ... 18945.000000 18945.000000 18945.000000 18945.000000 18945.000000 18945.000000 18945.000000 18945.000000 18945.000000 18945.000000
mean 49.409290 104.620639 17.589707 68.045395 91.260333 75.979784 81.576194 85.303985 49.655529 16.382792 ... 54.711375 21.016785 50.920137 39.622486 26.620164 21.099023 59.585537 74.487305 82.655054 76.081499
std 511.986757 1139.662971 136.014975 553.362211 472.099720 571.441098 504.632248 911.153373 406.561440 160.981562 ... 633.970615 212.338278 281.722199 329.984580 168.460343 217.871697 394.584632 594.260858 699.898130 863.857880
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 1.000000 1.000000 4.000000 6.000000 4.000000 1.000000 3.000000 0.000000 ... 0.000000 1.000000 4.000000 1.000000 0.000000 1.000000 0.000000 0.000000 1.000000 1.000000
75% 28.000000 27.000000 10.000000 36.000000 57.000000 52.000000 55.000000 48.000000 33.000000 9.000000 ... 28.000000 13.000000 42.000000 25.000000 17.000000 13.000000 42.000000 45.000000 55.000000 48.000000
max 46744.000000 82047.000000 10582.000000 46856.000000 29534.000000 50972.000000 36236.000000 56068.000000 24994.000000 13587.000000 ... 49147.000000 17800.000000 23355.000000 29540.000000 12149.000000 19285.000000 28021.000000 40708.000000 46261.000000 68790.000000

8 rows × 313 columns

In [60]:
ss_hcc_filt.describe()
Out[60]:
output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A12_Normoxia_S26_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A4_Hypoxia_S8_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A5_Hypoxia_S108_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A6_Hypoxia_S11_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A7_Normoxia_S113_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A8_Normoxia_S119_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate1A9_Normoxia_S20_Aligned.sortedByCoord.out.bam ... output.STAR.PCRPlate4G12_Normoxia_S243_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4G1_Hypoxia_S193_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4G2_Hypoxia_S198_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4G6_Hypoxia_S232_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4G7_Normoxia_S204_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4H10_Normoxia_S210_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4H11_Normoxia_S214_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4H7_Normoxia_S205_Aligned.sortedByCoord.out.bam output.STAR.PCRPlate4H9_Normoxia_S236_Aligned.sortedByCoord.out.bam
count 19503.000000 19503.000000 19503.000000 19503.000000 19503.000000 19503.000000 19503.000000 19503.000000 19503.000000 19503.000000 ... 19503.000000 19503.000000 19503.000000 19503.000000 19503.000000 19503.000000 19503.000000 19503.000000 19503.000000 19503.000000
mean 119.427883 249.107522 180.739527 42.818233 56.444393 183.264677 162.976670 46.014305 54.589961 96.803210 ... 91.583090 126.622930 64.801005 35.702302 34.670461 125.629544 42.195457 129.769010 44.715941 91.517561
std 577.934133 1069.768525 1067.470509 224.823960 596.882811 944.432350 951.367277 289.708746 401.024242 487.943421 ... 377.847391 585.760835 375.921207 203.991666 147.706909 484.448028 186.359651 643.033801 197.842998 402.529704
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 3.000000 7.000000 0.000000 1.000000 0.000000 0.000000 4.000000 0.000000 0.000000 9.000000 ... 9.000000 4.000000 1.000000 2.000000 1.000000 15.000000 3.000000 4.000000 2.000000 8.000000
75% 75.000000 179.000000 111.000000 31.000000 29.000000 126.000000 106.000000 32.000000 29.000000 78.000000 ... 77.000000 94.000000 42.000000 25.000000 27.000000 105.000000 34.000000 94.000000 32.000000 63.000000
max 35477.000000 69068.000000 70206.000000 17326.000000 47442.000000 43081.000000 62813.000000 30240.000000 35450.000000 42310.000000 ... 19629.000000 30987.000000 21894.000000 13457.000000 11488.000000 33462.000000 15403.000000 34478.000000 10921.000000 28532.000000

8 rows × 227 columns

In [61]:
ss_mcf7_norm.describe()
Out[61]:
output.STAR.2_B3_Norm_S57_Aligned.sortedByCoord.out.bam output.STAR.2_B4_Norm_S58_Aligned.sortedByCoord.out.bam output.STAR.2_B5_Norm_S59_Aligned.sortedByCoord.out.bam output.STAR.2_B6_Norm_S60_Aligned.sortedByCoord.out.bam output.STAR.2_B7_Hypo_S79_Aligned.sortedByCoord.out.bam output.STAR.2_B9_Hypo_S81_Aligned.sortedByCoord.out.bam output.STAR.2_C10_Hypo_S130_Aligned.sortedByCoord.out.bam output.STAR.2_C11_Hypo_S131_Aligned.sortedByCoord.out.bam output.STAR.2_C1_Norm_S103_Aligned.sortedByCoord.out.bam output.STAR.2_C2_Norm_S104_Aligned.sortedByCoord.out.bam ... output.STAR.4_H10_Hypo_S382_Aligned.sortedByCoord.out.bam output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam
count 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 ... 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000
mean 74.140333 90.907000 99.089000 88.137000 110.395667 148.849000 126.422667 142.229667 91.781000 91.426333 ... 144.008333 133.846000 98.699333 84.070333 101.416333 96.636667 92.344333 154.387333 125.340000 132.017667
std 345.005307 409.560228 442.980702 425.804372 822.178446 1710.088769 1351.567001 1515.496440 388.660906 376.793214 ... 1349.125183 1242.320764 417.410827 406.100983 513.988262 499.224863 680.698856 1169.686762 1066.926126 1422.143351
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 24.000000 37.000000 33.000000 34.000000 38.250000 24.000000 13.000000 22.000000 37.000000 44.000000 ... 33.000000 38.000000 52.250000 25.000000 33.000000 44.000000 17.000000 19.000000 21.000000 20.250000
max 8222.000000 10167.000000 11446.000000 10312.000000 30586.000000 65037.000000 52680.000000 60789.000000 9394.000000 9077.000000 ... 56392.000000 50404.000000 11352.000000 8713.000000 17006.000000 16625.000000 29663.000000 34565.000000 34175.000000 57814.000000

8 rows × 250 columns

In [62]:
ss_mcf7_norm.head(5)
Out[62]:
output.STAR.2_B3_Norm_S57_Aligned.sortedByCoord.out.bam output.STAR.2_B4_Norm_S58_Aligned.sortedByCoord.out.bam output.STAR.2_B5_Norm_S59_Aligned.sortedByCoord.out.bam output.STAR.2_B6_Norm_S60_Aligned.sortedByCoord.out.bam output.STAR.2_B7_Hypo_S79_Aligned.sortedByCoord.out.bam output.STAR.2_B9_Hypo_S81_Aligned.sortedByCoord.out.bam output.STAR.2_C10_Hypo_S130_Aligned.sortedByCoord.out.bam output.STAR.2_C11_Hypo_S131_Aligned.sortedByCoord.out.bam output.STAR.2_C1_Norm_S103_Aligned.sortedByCoord.out.bam output.STAR.2_C2_Norm_S104_Aligned.sortedByCoord.out.bam ... output.STAR.4_H10_Hypo_S382_Aligned.sortedByCoord.out.bam output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam
CYP1B1 343 131 452 27 5817 3841 9263 21543 1013 53 ... 7890 4512 160 351 327 196 504 34565 20024 5953
CYP1B1-AS1 140 59 203 7 2669 1565 3866 9113 459 22 ... 3647 2035 75 138 130 102 238 13717 7835 2367
CYP1A1 0 0 0 0 0 79 238 443 0 0 ... 86 1654 0 0 0 1 0 11274 563 522
NDRG1 0 1 0 0 654 1263 2634 540 0 13 ... 481 1052 0 0 54 243 62 1263 925 1572
DDIT4 386 289 0 288 2484 2596 1323 2044 36 204 ... 3692 2410 800 1 189 266 417 4256 12733 2275

5 rows × 250 columns

Looking at mean values, we immediately see that the normalized data has not been log-transformed. We will explore later whether further scaling or transformations are needed before training ML models on this data.

In [63]:
print("MCF7 raw:", ss_mcf7_raw.shape)
print("Filtered shape:", ss_mcf7_filt.shape)
print("Normalised shape:", ss_mcf7_norm.shape)

print("\nHCC1806 raw:", ss_hcc_raw.shape)
print("Filtered shape:", ss_hcc_filt.shape)
print("Normalised shape:", ss_hcc_norm.shape)
MCF7 raw: (22934, 383)
Filtered shape: (18945, 313)
Normalised shape: (3000, 250)

HCC1806 raw: (23396, 243)
Filtered shape: (19503, 227)
Normalised shape: (3000, 182)

Comparing shapes: raw data have all genes and cells, filtered data remove low-quality features, and normalized data keep only 3,000 genes.

SmartSeq MCF7 Raw vs Filtered¶

In this section, we compare our own simple filtering rules to the provided Smart-Seq filtered dataset for MCF7. We’ll look at two aspects:

  • Gene Filtering – identify how many genes our “expressed in >5 cells” rule removes versus the official filter, inspect the dropped genes’ expression and dispersion statistics, and check why they might have been excluded
  • Cell Filtering – count how many cells we’d remove based on simple QC thresholds (total counts >250 k & genes >5 k), compare that to the provided filtered set, and see where any discrepancies lie

Gene Filtering¶

In [64]:
ss_mcf7_raw_filt.shape
Out[64]:
(22843, 337)
In [65]:
genes_raw = set(ss_mcf7_raw.index)
genes_filtered = set(ss_mcf7_filt.index)

dropped_genes = genes_raw - genes_filtered
print(f"Genes dropped: {len(dropped_genes)}")
Genes dropped: 3989

Our filter removed significantly less genes (22934-22843=91) than the version provided to us (22934-18945=3989), so there must be additional criteria in the pipeline.

In [66]:
# Check expression stats for dropped genes
dropped_stats = ss_mcf7_raw.loc[list(dropped_genes)].sum(axis=1).describe()
print(dropped_stats)
count    3989.000000
mean       19.475307
std        43.194078
min         2.000000
25%         4.000000
50%         8.000000
75%        20.000000
max      1530.000000
dtype: float64

The dropped genes still show moderate total counts, so they aren’t all low-expressed “noise". Let's see what other filters could have been aplied.

In [67]:
# Genes that are expressed in more than 5 cells
ss_mcf7_raw_genes_mask = (ss_mcf7_raw > 0).sum(axis=1) > 5  # higher threshold results in less than 18945 genes remaining
ss_mcf7_raw_gene_set = set(ss_mcf7_raw.index[ss_mcf7_raw_genes_mask])

print(f"Genes passing our threshold: {len(ss_mcf7_raw_gene_set)}")
Genes passing our threshold: 19182

Only genes that are expressed in more than 5 cells were retained in ss_mcf7_filt:

In [68]:
ss_mcf7_raw_gene_set = ss_mcf7_raw.index[ss_mcf7_raw_genes_mask]

ss_mcf7_filt_gene_set = ss_mcf7_filt.index

overlap = ss_mcf7_raw_gene_set.intersection(ss_mcf7_filt_gene_set)
print(f"Overlap: {len(overlap)} / {len(ss_mcf7_filt_gene_set)}")
Overlap: 18945 / 18945

In our 19182 genes there are the filtered 18945 genes, but we still need to investigate why the remaining 237 genes were discarded.

In [69]:
# Convert Indexes to sets
ss_mcf7_raw_gene_set = set(ss_mcf7_raw_gene_set)
ss_mcf7_filt_gene_set = set(ss_mcf7_filt_gene_set)

# Find extra genes (present in the threshold but not in the filtered set)
extra_genes = ss_mcf7_raw_gene_set - ss_mcf7_filt_gene_set
extra_genes_list = list(extra_genes)
ss_mcf7_raw.loc[extra_genes_list].mean(axis=1).describe()
Out[69]:
count    237.000000
mean       0.125183
std        0.160211
min        0.015666
25%        0.033943
50%        0.078329
75%        0.156658
max        1.360313
dtype: float64
In [70]:
# Expression counts for the remaining 237 genes
ss_mcf7_raw.loc[extra_genes_list].sum(axis=1).describe()
Out[70]:
count    237.000000
mean      47.945148
std       61.360780
min        6.000000
25%       13.000000
50%       30.000000
75%       60.000000
max      521.000000
dtype: float64
In [71]:
subset = ss_mcf7_raw.loc[extra_genes_list]
gene_means = subset.mean(axis=1)
gene_vars = subset.var(axis=1)
gene_dispersion = gene_vars / gene_means.replace(0, np.nan)

print("\nVariance:")
print(gene_vars.describe())

print("\nDispersion (var / mean):")
print(gene_dispersion.describe())
Variance:
count    237.000000
mean       6.859440
std       35.000200
min        0.015461
25%        0.105792
50%        0.637089
75%        2.725698
max      432.822714
dtype: float64

Dispersion (var / mean):
count    237.000000
mean      17.180863
std       31.133525
min        0.984293
25%        2.979058
50%        8.507272
75%       18.379411
max      318.178694
dtype: float64

These genes are not low-variance “noise” genes:

  • Dispersion values are well above common filtering thresholds (e.g., 0.5–1.0)
  • Variance spans a wide range, including quite high (up to 432)

Next: Are they being dropped based on their identity (e.g. pseudogenes, mitochondrial, ribosomal)?

In [72]:
extra_genes_series = pd.Series(extra_genes_list)

# Check for known non-informative categories
is_mito = extra_genes_series.str.startswith("MT-")
is_ribo = extra_genes_series.str.startswith("RPL") | extra_genes_series.str.startswith("RPS")
is_pseudo = extra_genes_series.str.contains("-P")
is_mirna = extra_genes_series.str.contains("MIR")
is_linc = extra_genes_series.str.contains("LINC")

print(f"Mitochondrial: {is_mito.sum()}")
print(f"Ribosomal: {is_ribo.sum()}")
print(f"Pseudogenes (-P): {is_pseudo.sum()}")
print(f"miRNAs (MIR): {is_mirna.sum()}")
print(f"LINC: {is_linc.sum()}")
Mitochondrial: 0
Ribosomal: 13
Pseudogenes (-P): 0
miRNAs (MIR): 8
LINC: 13

Only ~14% fall into obvious categories (MT, RPL/RPS, etc.), so most dropped genes remain unexplained by standard filters.

In [73]:
print(extra_genes_list)
['SLC9C2', 'FEZF1-AS1', 'CALD1', 'KLRG1', 'ASAP1-IT2', 'GOLGA8R', 'LINC02137', 'MIR3143', 'DDR1-DT', 'LINC00661', 'PARP4P2', 'KCNC1', 'LINC00964', 'RPL12P33', 'CYP2F1', 'SLC34A3', 'TRAV27', 'SPOCK2', 'HSPB6', 'SNAP91', 'AMDHD1', 'FST', 'HMGN2P23', 'OR8B7P', 'HNRNPA1P35', 'MRRFP1', 'FNDC8', 'LINC02035', 'TNNC1', 'ARL9', 'ARHGEF18-AS1', 'SNTG1', 'XAGE2', 'SH3TC1', 'TAS2R43', 'TRPM3', 'LRRC43', 'P3H3', 'TLR9', 'IQGAP2', 'PRB2', 'STAB2', 'MAPK11', 'RPL7P36', 'ADM-DT', 'SOHLH2', 'QRICH2', 'C17orf50', 'TUSC8', 'LRIG3-DT', 'CELF3', 'RP1L1', 'IPCEF1', 'A2ML1', 'MSX2P1', 'HMGA1P3', 'PRSS30P', 'CD70', 'LINC00310', 'ARAP3', 'DDX10P2', 'MYO1A', 'AJAP1', 'RNA5SP18', 'MIR7161', 'SNORD3B-2', 'NDUFA3P1', 'CUX2', 'PKD1P2', 'HSPE1P6', 'HNRNPA3P9', 'RPL5P5', 'HMX3', 'CPB2-AS1', 'MXRA8', 'CELF4', 'MCTP2', 'RNA5SP392', 'PDPN', 'RPL23AP73', 'HSPE1P7', 'PPIAP51', 'SLC7A9', 'SNORA68B', 'CA5A', 'DNAJC8P1', 'PPARGC1A', 'OR2A7', 'LINC02169', 'TYRO3P', 'RPL32P2', 'HLA-S', 'RNA5SP440', 'HTR2B', 'AMPH', 'RALBP1P1', 'MIR4449', 'PSAT1', 'RNA5SP477', 'TRMT1P1', 'CCBE1', 'TAL2', 'UBE2CP2', 'LRAT', 'KLF1', 'OR7E7P', 'OR8B5P', 'MAMDC2-AS1', 'WIPF3', 'ABHD12B', 'GPR150', 'CHRNG', 'SLC7A2-IT1', 'MIR3188', 'ITGA1', 'MIR503', 'CCN3', 'LINC02895', 'LINC00572', 'RN7SL688P', 'HHIP', 'RPL23AP10', 'IGLV1-51', 'VAV1', 'SYT2', 'MIR2861', 'ADIPOQ', 'KRT37', 'CACNA1A', 'RNVU1-21', 'BACH1-AS1', 'KIF26B', 'RNU1-1', 'AQP8', 'RPS7P11', 'HOXB-AS1', 'C3orf35', 'ALOX15P1', 'NUTM2F', 'PLAT', 'IGF2-AS', 'NAPSB', 'TEX38', 'RPL7P52', 'LRRC15', 'ARG1', 'TSPEAR-AS2', 'FER1L6', 'PTMAP15', 'SEC14L5', 'TMEM47', 'LINC01224', 'WWC2-AS2', 'RRAD', 'DHX58', 'RPL21P34', 'NR1I2', 'LINC01863', 'PNMA8B', 'KLHL6', 'LCN2', 'DNAH12', 'CHMP4BP1', 'PIWIL2', 'DDC', 'TCF7L1-IT1', 'COLGALT2', 'HLA-DQA2', 'NOVA2', 'CPLX3', 'PCDHB5', 'CABP1', 'NRXN3', 'RBBP4P4', 'TRIM72', 'VTRNA1-1', 'ENPP3', 'SNORD98', 'MIR4271', 'TIGD4', 'OR4F21', 'HMGB1P26', 'EYS', 'VWA2', 'ITIH4', 'BMS1P3', 'FAM71F2', 'LINC02009', 'GAPDHP39', 'CES3', 'ANKLE1', 'PADI1', 'MRPL23-AS1', 'PEG10', 'THSD4-AS1', 'SHD', 'FTH1P1', 'PTCHD4', 'RN7SL605P', 'BATF2', 'AOC1', 'ABCC13', 'TSPEAR', 'MIR3175', 'TRIM17', 'RBP5', 'ELMOD1', 'LINC02014', 'BANK1', 'RPL26P6', 'RPL8P1', 'TMEM88', 'LARP1P1', 'H3P42', 'RPL4P2', 'TIMP4', 'FER1L6-AS2', 'FAM135A', 'ITGB7', 'RPL21P131', 'EPSTI1', 'IGLL3P', 'DPP3P1', 'GZMM', 'RN7SKP36', 'VILL', 'LINC01456', 'SPATA46', 'ANGPT4', 'S100A7', 'A2M', 'SNCAIP', 'DAND5', 'NFYBP1', 'EPO', 'SERPINA4', 'GOSR2-DT']

Let us see whether the removed genes are duplicates.

In [74]:
duplicate_rows = ss_mcf7_raw[ss_mcf7_raw.duplicated(keep=False)]
print("number of duplicate rows: ", duplicate_rows.shape[0])

# Convert to sets
duplicate_gene_set = set(duplicate_rows.index)
extra_removed_set = set(extra_genes_list)

# Intersect to find which of the 237 removed genes were duplicates
dup_overlap = extra_removed_set.intersection(duplicate_gene_set)

print(f"Removed genes that were also duplicates: {len(dup_overlap)}")
print("Example overlapping genes:", list(dup_overlap)[:10])
number of duplicate rows:  56
Removed genes that were also duplicates: 0
Example overlapping genes: []

The removed genes are not duplicates. The remaining ~200 are likely filtered by manual curation or a custom blacklist.

Cell Filtering¶

In [75]:
dropped_cells = ss_mcf7_raw.shape[1] - ss_mcf7_filt.shape[1]
print(f"Number of removed cells: {dropped_cells}")
Number of removed cells: 70

Now we examine some properties of the dropped cells to identify why they were discarded.

In [76]:
qc_cells = pd.DataFrame({
    "total_counts": ss_mcf7_raw.sum(axis=0),
    "n_genes": (ss_mcf7_raw > 0).sum(axis=0)
})

retained_cells = ss_mcf7_filt.columns
dropped_cells = ss_mcf7_raw.columns.difference(retained_cells)

qc_retained = qc_cells.loc[retained_cells]
qc_dropped = qc_cells.loc[dropped_cells]

print("Retained cells:")
print(qc_retained.describe())

print("\nDropped cells:")
print(qc_dropped.describe())
Retained cells:
       total_counts       n_genes
count  3.130000e+02    313.000000
mean   1.158035e+06  10046.894569
std    3.964047e+05   1258.586404
min    2.633690e+05   5358.000000
25%    8.801830e+05   9326.000000
50%    1.199119e+06  10242.000000
75%    1.460597e+06  10922.000000
max    1.982470e+06  12260.000000

Dropped cells:
       total_counts       n_genes
count  7.000000e+01     70.000000
mean   2.638760e+05   4998.542857
std    5.509890e+05   3444.854204
min    1.000000e+00      1.000000
25%    6.060000e+03   2196.250000
50%    7.184050e+04   5204.000000
75%    1.470945e+05   8015.750000
max    2.308057e+06  12519.000000

Dropped cells have lower mean total counts (by one order of magnitude) and a lower mean number of genes (10,000 vs 5000)

In [77]:
# Apply a candidate filter
# We only keep the genes that have more than 250,000 total counts and more than 5000 genes
cell_mask = (qc_cells['total_counts'] > 250_000) & (qc_cells['n_genes'] > 5000)
filtered_candidate = set(qc_cells.index[cell_mask])

# Cells in the original filtered dataset
original_filtered = set(ss_mcf7_filt.columns)

# Overlap
overlap = filtered_candidate.intersection(original_filtered)
print(f"Candidate filter retains: {len(filtered_candidate)} cells")
print(f"Overlap with ss_mcf7_filt: {len(overlap)} / {len(original_filtered)} ({len(overlap)/len(original_filtered)*100:.1f}%)")
Candidate filter retains: 320 cells
Overlap with ss_mcf7_filt: 313 / 313 (100.0%)

Experimenting with multiple total counts and number of genes thresholds, we obtain that >250,000 and >5,000, respectively, are the best threshold-based filtering rules. Only 7 cells remain in our filtered dataset that are not present in ss_mcf7_filt and there is a perfect overlap for the other 313 cells. The remaining 7 cells were likely removed manually.

SmartSeq MCF7 Filtered vs Normalised + Filtered¶

In this section, we compare the filtered Smart-Seq matrix (all high-quality genes and cells) against the final normalized version that retains only 3,000 genes. We’ll walk through:

  • exploration: How many genes and cells are lost during normalization, and how do per-cell totals and detection rates change?
  • variance analysis: How does log-transform and normalization affect gene-wise variance and the choice of the top 3,000 genes?
  • normalization methods: Reconstruct the normalization steps (e.g. total-count scaling to 1e6 or median library size) to pinpoint which approach matches the provided data.
  • cell dropout: Investigate why 63 cells vanish post-normalization; do their QC metrics or expression sparsity explain the removal?
  • gene dropout: Examine the 15,945 genes dropped; does variance or dispersion alone predict their exclusion, or is a more sophisticated highly variable gene selection algorithm (e.g. Scanpy’s Seurat flavor) required?

Together, these analyses reveal exactly how filtering and normalization reshape our data, and why certain cells or genes are retained or discarded in the final training set.

Exploration¶

In [78]:
# Genes in filtered but not in normalised
dropped_genes = ss_mcf7_filt.index.difference(ss_mcf7_norm.index)
print(f"Genes dropped during normalization: {len(dropped_genes)}")
Genes dropped during normalization: 15945
In [79]:
dropped_cells = ss_mcf7_filt.columns.difference(ss_mcf7_norm.columns)
print(f"Cells dropped during normalization: {len(dropped_cells)}")
Cells dropped during normalization: 63

A substantial number of genes (15,945) and a smaller set of cells (63) are removed when we go from filtered to normalized. Let’s see how that impacts counts and detection rates.

In [80]:
# Total counts per cell = sum of all gene expression values
total_counts_before = ss_mcf7_filt.sum(axis=0)

# Number of expressed genes (non-zero) per cell
n_genes_before = (ss_mcf7_filt > 0).sum(axis=0)

total_counts_after = ss_mcf7_norm.sum(axis=0)
n_genes_after = (ss_mcf7_norm > 0).sum(axis=0)

print(f"Average total counts (before): {total_counts_before.mean():.2f}")
print(f"Average total counts (after):  {total_counts_after.mean():.2f}\n")

print(f"Average n_genes per cell (before): {n_genes_before.mean():.2f}")
print(f"Average n_genes per cell (after):  {n_genes_after.mean():.2f}")
Average total counts (before): 1157815.77
Average total counts (after):  347700.15

Average n_genes per cell (before): 10011.00
Average n_genes per cell (after):  1091.34

Total counts and number of genes were reduced by a factor of 10.

In [81]:
# Combine total counts
ss_mcf7_total_counts = pd.DataFrame({
    'total_counts': pd.concat([total_counts_before, total_counts_after]),
    'stage': ['Before'] * len(total_counts_before) + ['After'] * len(total_counts_after)
})

# Combine gene counts
ss_mcf7_n_genes = pd.DataFrame({
    'n_genes': pd.concat([n_genes_before, n_genes_after]),
    'stage': ['Before'] * len(n_genes_before) + ['After'] * len(n_genes_after)
})


plt.figure(figsize=(12, 5))

# Violin plot for total counts
plt.subplot(1, 2, 1)
sns.violinplot(data=ss_mcf7_total_counts, x='stage', y='total_counts', hue='stage', palette='Set2', legend=False)
plt.title("Total Counts per Cell")
plt.xlabel("")

# Violin plot for # of genes
plt.subplot(1, 2, 2)
sns.violinplot(data=ss_mcf7_n_genes, x='stage', y='n_genes', hue='stage', palette='Set2', legend=False)
plt.title("Number of Genes per Cell")
plt.xlabel("")

plt.suptitle("Before vs After Normalization", fontsize=14)
plt.tight_layout()
plt.show()
No description has been provided for this image

The “After” violins are noticeably narrower, indicating that cells have been rescaled to a common library size, thereby reducing variability in total counts and gene detection across cells.

In [82]:
# How much does mean (log-transformed) expression vary across cells
print("Filtered + log2 mean variance:", np.log2(ss_mcf7_filt + 1).var(axis=1).mean())
print("Normalised + log2 mean variance:", np.log2(ss_mcf7_norm + 1).var(axis=1).mean())
Filtered + log2 mean variance: 2.7114283938366706
Normalised + log2 mean variance: 3.9135395503089314

The normalised data has a higher mean variance per gene under log-transform. Let us examine this further:

In [83]:
ss_mcf7_filt_log = np.log2(ss_mcf7_filt + 1)
ss_mcf7_norm_log = np.log2(ss_mcf7_norm + 1)

filt_gene_var_log = ss_mcf7_filt_log.var(axis=1)
norm_gene_var_log = ss_mcf7_norm_log.var(axis=1)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Raw variance
sns.histplot(filt_gene_var_log, bins=100, color='red', stat='density', ax=axes[0])
axes[0].set_title("Log2 Gene Variance in Filtered Data")
axes[0].set_xlabel("Variance")

# Log-transformed variance
sns.histplot(norm_gene_var_log, bins=100, color='blue', stat='density', ax=axes[1])
axes[1].set_title("Log2 Gene Variance in Normalised Data")
axes[1].set_xlabel("Variance")

plt.tight_layout()
plt.show()
No description has been provided for this image

The left plot shows the distribution of gene variance (log2-transformed) in the filtered data. Most genes exhibit very low variance, with only a few showing substantial variability. After normalization (right plot), the distribution becomes broader, with more genes showing moderate-to-high variance (histogram shifts to the right). This indicates improved dynamic range and suggests that normalization helped preserve biologically variable genes while reducing the influence of low-variance or uninformative ones.

In [84]:
# Calculate variance per gene in filtered data
gene_var = ss_mcf7_filt.var(axis=1)

# Get top 3000 highly variable genes
top_var_genes = gene_var.sort_values(ascending=False).head(3000)

# Compare to normalized gene set
kept_genes = ss_mcf7_norm.index

# How many retained genes overlap with the top variable genes?
overlap = kept_genes.intersection(top_var_genes.index)
print(f"{len(overlap)} of {len(kept_genes)} genes in normalized data are among the top 3000 variable genes.")
894 of 3000 genes in normalized data are among the top 3000 variable genes.

It appears that a more advanced technique was used to pinpoint the top 3000 variable genes. We explore this further in section 'Why 15945 Genes Were Dropped?'.

What normalisation could have been applied?¶

Let's try rescaling the filtered data so that each cell has a total of 1M counts.

In [85]:
sc.pp.normalize_total(adata, target_sum=1e6)  # each cell ends up with a total of 1,000,000 counts

ss_mcf7_norm_like = ss_mcf7_filt.div(ss_mcf7_filt.sum(axis=0), axis=1) * 1e6  # manual reimplementation of Scanpy’s normalize_total
total_counts_norm_like = ss_mcf7_norm_like.sum(axis=0)

print("Total counts after rescaling to 1e6:", total_counts_norm_like.describe())  # we expect min = max = mean = 1e6

# Take the intersection of shared genes and cells (250, 3000)
common_cells = ss_mcf7_norm.columns.intersection(ss_mcf7_norm_like.columns)
common_genes = ss_mcf7_norm.index.intersection(ss_mcf7_norm_like.index)

diff = (ss_mcf7_norm.loc[common_genes, common_cells] - 
        ss_mcf7_norm_like.loc[common_genes, common_cells]).abs().mean().mean()  # mean across all genes and all cells

print(f"Mean abs difference to scaled-to-1e6 normalization: {diff:.2f}")
Total counts after rescaling to 1e6: count    3.130000e+02
mean     1.000000e+06
std      5.434847e-11
min      1.000000e+06
25%      1.000000e+06
50%      1.000000e+06
75%      1.000000e+06
max      1.000000e+06
dtype: float64
Mean abs difference to scaled-to-1e6 normalization: 19.31

Absolute differences are small (~19 across ~1M-range values), but they don't tell us much because expression values span a large range.

In [86]:
# Flatten both matrices and extract the common genes/cells
flat_original = ss_mcf7_norm.loc[common_genes, common_cells].values.flatten()
flat_recreated = ss_mcf7_norm_like.loc[common_genes, common_cells].values.flatten()

# Compute Pearson correlation
cor = np.corrcoef(flat_original, flat_recreated)[0,1]
print(f"Pearson correlation of expression values: {cor:.4f}")
Pearson correlation of expression values: 0.9999

There is an almost perfect linear relationship between the expression values in the two matrices, which makes sene as we are just scaling ss_mcf7_filt and so if a value is high in one matrix, it will be high in the other as well. Nevertheless, the relatively small absolute difference suggests that we are on the right track.

Next we look at Scanpy's suggested approach: normalising to median total counts.

In [87]:
# Scanpy Normalisation

# Transpose the matrix to match AnnData convention: cells as rows
X = ss_mcf7_filt.T  # shape: (cells × genes)

# Convert to AnnData
adata = ad.AnnData(X=X)

# Optional: name the genes and cells
adata.var_names = ss_mcf7_filt.index
adata.obs_names = ss_mcf7_filt.columns

# Normalise total counts per cell (default target_sum is median library size)
sc.pp.normalize_total(adata)

# Convert back to DataFrame
adata_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names)
In [88]:
# Transpose back so that both datasets are genes × cells
adata_df_T = adata_df.T

# Get common genes and cells
common_genes = ss_mcf7_norm.index.intersection(adata_df_T.index)
common_cells = ss_mcf7_norm.columns.intersection(adata_df_T.columns)

# Compute absolute difference
diff_matrix = (ss_mcf7_norm.loc[common_genes, common_cells] - 
               adata_df_T.loc[common_genes, common_cells]).abs()

# Mean absolute difference
mean_abs_diff = diff_matrix.mean().mean()
print(f"Mean absolute difference: {mean_abs_diff:.2f}")

# Pearson correlation between flattened matrices
flat_orig = ss_mcf7_norm.loc[common_genes, common_cells].values.flatten()
flat_scanpy = adata_df_T.loc[common_genes, common_cells].values.flatten()

cor = np.corrcoef(flat_orig, flat_scanpy)[0, 1]
print(f"Pearson correlation: {cor:.4f}")
Mean absolute difference: 0.77
Pearson correlation: 0.9999

Normalizing to each cell’s median library size produces an almost perfect match - confirming that Scanpy’s default normalize_total (with median target) is the likely pipeline.

Is normalising to median counts a good approach? Yes!

Cells vary in sequencing depth: Some cells may have more total counts just due to being more deeply sequenced, not because they express more genes biologically. Total count normalization controls for this technical variability by rescaling each cell to have the same total count, making gene expression values comparable across cells. Using the median total count instead of a fixed value (like 1e5 or 1e6) ensures the scaling is dataset-specific and robust to outliers.

Why 63 Cells Were Dropped?¶

In [89]:
# Quality Control for the dropped cells in the original space
dropped_cells = ss_mcf7_filt.columns.difference(ss_mcf7_norm.columns)
qc_metrics = pd.DataFrame({
    "total_counts": ss_mcf7_filt.sum(axis=0),
    "n_genes": (ss_mcf7_filt > 0).sum(axis=0)
})

qc_metrics.loc[dropped_cells].describe()
Out[89]:
total_counts n_genes
count 6.300000e+01 63.000000
mean 1.104397e+06 9660.587302
std 5.121923e+05 1192.260534
min 2.633430e+05 6433.000000
25% 7.101980e+05 8902.500000
50% 1.229842e+06 9791.000000
75% 1.515264e+06 10590.500000
max 1.982038e+06 11780.000000
In [90]:
# Quality Control for the retained cells in the original space
cells_retained = ss_mcf7_norm.columns
qc_metrics.loc[cells_retained].describe()
Out[90]:
total_counts n_genes
count 2.500000e+02 250.000000
mean 1.171277e+06 10099.304000
std 3.613786e+05 1253.815198
min 2.847180e+05 5322.000000
25% 9.383340e+05 9443.250000
50% 1.198204e+06 10303.000000
75% 1.448138e+06 10955.250000
max 1.970851e+06 12217.000000

QC metrics (total counts, genes detected) are nearly identical for dropped vs. retained cells, so their removal was probably tied to the gene-selection step rather than poor quality.

In [91]:
# Subset filtered matrix to just the 3000 genes in ss_mcf7_norm
genes_to_keep = ss_mcf7_norm.index
cells_to_check = ss_mcf7_filt.columns.difference(ss_mcf7_norm.columns)

subset = ss_mcf7_filt.loc[genes_to_keep, cells_to_check]

# Number of expressed genes (non-zero) per dropped cell
nonzeros_per_cell = (subset > 0).sum(axis=0)
nonzeros_per_cell.describe()
Out[91]:
count      63.000000
mean     1037.968254
std       149.883715
min       662.000000
25%       927.500000
50%      1070.000000
75%      1155.500000
max      1322.000000
dtype: float64
In [92]:
# Retained cells
cells_retained = ss_mcf7_norm.columns
subset_retained = ss_mcf7_filt.loc[genes_to_keep, cells_retained]

nonzeros_retained = (subset_retained > 0).sum(axis=0)
nonzeros_retained.describe()
Out[92]:
count     250.00000
mean     1082.24800
std       167.51017
min       574.00000
25%       982.25000
50%      1092.00000
75%      1197.75000
max      1455.00000
dtype: float64

Dropped cells still have thousands of detected genes among the top 3k set, so no obvious sparsity issue explains their removal.

In [93]:
# Function to extract just the condition (Hypo or Norm)
def extract_condition(colname):
    return colname.split('_')[2]  # "Hypo" or "Norm"

# Apply to all cells in filtered data
condition_all = ss_mcf7_filt.columns.to_series().apply(extract_condition)

# Apply to only the 250 retained cells
condition_retained = condition_all.loc[ss_mcf7_norm.columns]

# Count full and retained condition distributions
print("Condition distribution in full filtered set (313 cells):")
print(condition_all.value_counts())

print("\nCondition distribution in retained normalized set (250 cells):")
print(condition_retained.value_counts())
Condition distribution in full filtered set (313 cells):
Norm    158
Hypo    155
Name: count, dtype: int64

Condition distribution in retained normalized set (250 cells):
Norm    126
Hypo    124
Name: count, dtype: int64

The hypoxia vs. normoxia balance remains consistent, suggesting no condition-specific bias in cell dropout.

It is unclear to us why exactly those 63 cells were dropped.

Why 15945 Genes Were Dropped?¶

First we examine whether the genes could have been dropped due to low variance or low disperson:

In [94]:
genes_norm = ss_mcf7_norm.index

ss_mcf7_norm_like = ss_mcf7_filt.div(ss_mcf7_filt.sum(axis=0), axis=1) * 1e5

gene_stats = pd.DataFrame({
    'mean': ss_mcf7_norm_like.mean(axis=1),
    'variance': ss_mcf7_norm_like.var(axis=1)
})

gene_stats_sorted = gene_stats.sort_values(by='variance', ascending=False)

top_3000_var = gene_stats_sorted.head(3000)
genes_predicted = top_3000_var.index

# How many genes overlap?
overlap = genes_predicted.intersection(genes_norm)
print(f"Overlap with ss_mcf7_norm genes (variance): {len(overlap)} / 3000")

gene_stats['dispersion'] = gene_stats['variance'] / gene_stats['mean']
gene_stats_filtered = gene_stats[gene_stats['mean'] > 0]  # avoid div by zero
top_3000_disp = gene_stats_filtered.sort_values(by='dispersion', ascending=False).head(3000)

overlap_disp = top_3000_disp.index.intersection(genes_norm)
print(f"Overlap with ss_mcf7_norm genes (dispersion): {len(overlap_disp)} / 3000")
Overlap with ss_mcf7_norm genes (variance): 930 / 3000
Overlap with ss_mcf7_norm genes (dispersion): 1572 / 3000

High variance is not a good predictor for which genes were retained. Dispersion is better, but still not good enough, so we try Scanpy's HVG selection.

In [95]:
# Use our normalisation
ss_mcf7_norm_full = adata_df.T

# Convert to AnnData
adata = anndata.AnnData(X=np.log1p(ss_mcf7_norm_full.T.astype(float)))  # See Note below
adata.var_names = ss_mcf7_filt.index
adata.obs_names = ss_mcf7_filt.columns

# Run HVG selection on the full gene set
sc.pp.highly_variable_genes(
    adata,
    flavor='seurat',
    n_top_genes=3000,
    inplace=True
)

# Filter to top 3,000 genes
adata = adata[:, adata.var['highly_variable']]

# Get the 3000 HVG gene names just selected by Scanpy
hvgs_from_filt = adata.var_names

# Get the original 3000 gene names from the ss_mcf7_norm matrix
hvgs_from_norm = ss_mcf7_norm.index

# Compute overlap
overlap = hvgs_from_filt.intersection(hvgs_from_norm)
print(f"Overlap: {len(overlap)} / 3000")
print(f"Percenatge Overlap: {len(overlap)/ 3000 * 100}")
Overlap: 2346 / 3000
Percenatge Overlap: 78.2

Using Scanpy’s Seurat-flavored highly_variable_genes reproduces ~78% of the official gene set, confirming that HVG selection drove most gene exclusions. 78% overlap is very good because these 3000 genes make up only 15% of the genes in ss_mcf7_filt.

Note: The log1p transformation is applied to stabilize variance and reduce the influence of highly expressed genes. This makes the selection of highly variable genes (HVGs) more biologically meaningful and statistically robust. This approach follows best practices from the Scanpy tutorials.

SmartSeq HCC1806 Raw vs Filtered¶

Here we repeat the gene- and cell-filtering comparison for the HCC1806 line, using the same “expressed in >5 cells” rule and QC thresholds as before. We’ll quantify how many genes and cells our simple filters remove versus the provided ss_hcc_filt dataset, and inspect any discrepancies for signs of manual curation or additional criteria.

Gene Filtering¶

In [96]:
ss_hcc_raw_filt.shape
Out[96]:
(23339, 233)
In [97]:
ss_hcc_filt.shape
Out[97]:
(19503, 227)
In [98]:
genes_raw = set(ss_hcc_raw.index)
genes_filtered = set(ss_hcc_filt.index)

dropped_genes = genes_raw - genes_filtered
print(f"Genes dropped: {len(dropped_genes)}")
Genes dropped: 3893

We see how many total genes the official filter removed compared to the raw set. We apply the same expression threshold as for MCF7:

In [99]:
# Genes that are expressed in more than 5 cells
ss_hcc_raw_genes_mask = (ss_hcc_raw > 0).sum(axis=1) > 5  # higher threshold results in less than 19503 genes remaining
ss_hcc_raw_gene_set = set(ss_hcc_raw.index[ss_hcc_raw_genes_mask])

print(f"Genes passing our threshold: {len(ss_hcc_raw_gene_set)}")
Genes passing our threshold: 19681

Our threshold drops far fewer genes than the official filter, indicating extra curation steps in the pipeline.

In [100]:
ss_hcc_raw_gene_set = ss_hcc_raw.index[ss_hcc_raw_genes_mask]

ss_hcc_filt_gene_set = ss_hcc_filt.index

overlap = ss_hcc_raw_gene_set.intersection(ss_hcc_filt_gene_set)
print(f"Overlap: {len(overlap)} / {len(ss_hcc_filt_gene_set)}")
Overlap: 19503 / 19503
In [101]:
# Convert Indexes to sets
ss_hcc_raw_gene_set = set(ss_hcc_raw_gene_set)
ss_hcc_filt_gene_set = set(ss_hcc_filt_gene_set)

# Find extra genes (present in the threshold but not in the filtered set)
extra_genes = ss_hcc_raw_gene_set - ss_hcc_filt_gene_set
extra_genes_list = list(extra_genes)
ss_hcc_raw.loc[extra_genes_list].mean(axis=1).describe()
Out[101]:
count    178.000000
mean       0.321589
std        0.438802
min        0.024691
25%        0.058642
50%        0.164609
75%        0.421811
max        3.720165
dtype: float64

Similarly to the MCF7 case, we have 178 genes filtered out of ss_hcc_raw that pass the above threshold and were most likely removed by manual curation.

Cell Filtering¶

In [102]:
dropped_cells = ss_hcc_raw.shape[1] - ss_hcc_filt.shape[1]
print(f"Number of removed cells: {dropped_cells}")
Number of removed cells: 16
In [103]:
qc_cells = pd.DataFrame({
    "total_counts": ss_hcc_raw.sum(axis=0),
    "n_genes": (ss_hcc_raw > 0).sum(axis=0)
})

retained_cells = ss_hcc_filt.columns
dropped_cells = ss_hcc_raw.columns.difference(retained_cells)

qc_retained = qc_cells.loc[retained_cells]
qc_dropped = qc_cells.loc[dropped_cells]

print("Retained cells:")
print(qc_retained.describe())

print("\nDropped cells:")
print(qc_dropped.describe())
Retained cells:
       total_counts       n_genes
count  2.270000e+02    227.000000
mean   2.095821e+06  10735.555066
std    1.084443e+06   1025.490256
min    3.421620e+05   7361.000000
25%    1.028269e+06  10260.500000
50%    2.157315e+06  10881.000000
75%    2.965222e+06  11431.500000
max    4.858841e+06  12698.000000

Dropped cells:
       total_counts       n_genes
count  1.600000e+01     16.000000
mean   8.274348e+05   4581.625000
std    1.683270e+06   5370.398208
min    1.140000e+02     35.000000
25%    4.207500e+02     84.000000
50%    3.734050e+04    884.000000
75%    2.899445e+05   9234.500000
max    5.758132e+06  13986.000000

Similar to the MCF7 case, total counts and detected genes have lower means among the dropped cells compared to the retained cells. For this reason, we filter out the cells with low total counts and low number of genes.

In [104]:
# Apply a candidate filter
cell_mask = (qc_cells['total_counts'] > 250_000) & (qc_cells['n_genes'] > 4000)
filtered_candidate = set(qc_cells.index[cell_mask])

# Cells in the original filtered dataset
original_filtered = set(ss_hcc_filt.columns)

# Overlap
overlap = filtered_candidate.intersection(original_filtered)
print(f"Candidate filter retains: {len(filtered_candidate)} cells")
print(f"Overlap with ss_hcc_filt: {len(overlap)} / {len(original_filtered)} ({len(overlap)/len(original_filtered)*100:.1f}%)")
Candidate filter retains: 230 cells
Overlap with ss_hcc_filt: 227 / 227 (100.0%)

Experimenting with multiple total counts and number of genes thresholds, we obtain that >250,000 and >4,000, respectively, are the best threshold-based filtering rules. Only 3 cells remain in our filtered dataset that are not present in ss_mcf7_filt and there is a perfect overlap for the other 227 cells. The remaining 3 cells were likely removed manually.

SmartSeq HCC1806 Filtered vs Normalised + Filtered¶

In this section, we examine how normalization and the final 3,000-gene selection reshape the filtered HCC1806 dataset. We explore:

  • exploration: how many genes and cells are lost during normalization, and how do per-cell totals and gene counts change?
  • normalization method: reconstruct Scanpy’s median-library-size normalization to confirm it matches the provided normalized matrix
  • cell retention: compare QC metrics of dropped vs. retained cells to understand the basis of cell removal
  • gene retention: use Scanpy’s highly variable gene (HVG) selection to see if it explains which 3,000 genes remain

Exploration¶

In [105]:
# Genes in filtered but not in normalised
dropped_genes = ss_hcc_filt.index.difference(ss_hcc_norm.index)
print(f"Genes dropped during normalization: {len(dropped_genes)}")
Genes dropped during normalization: 16503
In [106]:
dropped_cells = ss_hcc_filt.columns.difference(ss_hcc_norm.columns)
print(f"Cells dropped during normalization: {len(dropped_cells)}")
Cells dropped during normalization: 45

A large number of genes (~16 503) and a moderate number of cells (45) are removed in the transition to normalized data.

In [107]:
# Total counts per cell = sum of all gene expression values
total_counts_before = ss_hcc_filt.sum(axis=0)

# Number of expressed genes (non-zero) per cell
n_genes_before = (ss_hcc_filt > 0).sum(axis=0)

total_counts_after = ss_hcc_norm.sum(axis=0)
n_genes_after = (ss_hcc_norm > 0).sum(axis=0)

print(f"Average total counts (before): {total_counts_before.mean():.2f}")
print(f"Average total counts (after):  {total_counts_after.mean():.2f}\n")

print(f"Average n_genes per cell (before): {n_genes_before.mean():.2f}")
print(f"Average n_genes per cell (after):  {n_genes_after.mean():.2f}")
Average total counts (before): 2095393.92
Average total counts (after):  502580.62

Average n_genes per cell (before): 10686.55
Average n_genes per cell (after):  880.40

Total counts and detected genes drop modestly after normalization, which is expected when rescaling each cell’s library size.

In [108]:
# Combine total counts
ss_hcc_total_counts = pd.DataFrame({
    'total_counts': pd.concat([total_counts_before, total_counts_after]),
    'stage': ['Before'] * len(total_counts_before) + ['After'] * len(total_counts_after)
})

# Combine gene counts
ss_hcc_n_genes = pd.DataFrame({
    'n_genes': pd.concat([n_genes_before, n_genes_after]),
    'stage': ['Before'] * len(n_genes_before) + ['After'] * len(n_genes_after)
})


plt.figure(figsize=(12, 5))

# Violin plot for total counts
plt.subplot(1, 2, 1)
sns.violinplot(data=ss_hcc_total_counts, x='stage', y='total_counts', hue='stage', palette='Set2', legend=False)
plt.title("Total Counts per Cell")
plt.xlabel("")

# Violin plot for # of genes
plt.subplot(1, 2, 2)
sns.violinplot(data=ss_hcc_n_genes, x='stage', y='n_genes', hue='stage', palette='Set2', legend=False)
plt.title("Number of Genes per Cell")
plt.xlabel("")

plt.suptitle("Before vs After Normalization", fontsize=14)
plt.tight_layout()
plt.show()
No description has been provided for this image

The “After” violins are tighter, reflecting uniform library sizes across cells.

In [109]:
print("Filtered + log2 mean variance:", np.log2(ss_hcc_filt + 1).var(axis=1).mean())
print("Normalised + log2 mean variance:", np.log2(ss_hcc_norm + 1).var(axis=1).mean())
Filtered + log2 mean variance: 3.11473854285522
Normalised + log2 mean variance: 3.076855767159178

Unlike MCF7, HCC1806 shows almost no change in mean gene variance after normalization.

Normalisation¶

We’ll compare this reconstructed matrix to the provided normalized data.

In [110]:
# Scanpy Normalisation

# Transpose the matrix to match AnnData convention: cells as rows
X = ss_hcc_filt.T  # shape: (cells × genes)

# Convert to AnnData
adata = ad.AnnData(X=X)

# Optional: name the genes and cells
adata.var_names = ss_hcc_filt.index
adata.obs_names = ss_hcc_filt.columns

# Normalise total counts per cell (default target_sum is median library size)
sc.pp.normalize_total(adata)

# Our normalized + log-transformed matrix is now in:
adata.X  # (sparse or dense depending on input)

# Convert back to DataFrame
adata_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names)
In [111]:
# Transpose back, so that both datasets are genes × cells
adata_df_T = adata_df.T

# Get common genes and cells
common_genes = ss_hcc_norm.index.intersection(adata_df_T.index)
common_cells = ss_hcc_norm.columns.intersection(adata_df_T.columns)

# Compute absolute difference
diff_matrix = (ss_hcc_norm.loc[common_genes, common_cells] - 
               adata_df_T.loc[common_genes, common_cells]).abs()

# 1. Mean absolute difference
mean_abs_diff = diff_matrix.mean().mean()
print(f"Mean absolute difference: {mean_abs_diff:.2f}")

# 2. Pearson correlation between flattened matrices
flat_orig = ss_hcc_norm.loc[common_genes, common_cells].values.flatten()
flat_scanpy = adata_df_T.loc[common_genes, common_cells].values.flatten()

cor = np.corrcoef(flat_orig, flat_scanpy)[0, 1]
print(f"Pearson correlation: {cor:.4f}")
Mean absolute difference: 1.38
Pearson correlation: 0.9998

Near-zero absolute mean difference and r≈1 confirm the provided matrix was normalized to median library size.

Dropped vs Retained Cells¶

In [112]:
# Quality Control for the dropped cells in the original space
dropped_cells = ss_hcc_filt.columns.difference(ss_hcc_norm.columns)
qc_metrics = pd.DataFrame({
    "total_counts": ss_hcc_filt.sum(axis=0),
    "n_genes": (ss_hcc_filt > 0).sum(axis=0)
})

qc_metrics.loc[dropped_cells].describe()
Out[112]:
total_counts n_genes
count 4.500000e+01 45.000000
mean 2.495615e+06 10189.177778
std 1.176204e+06 1117.962641
min 6.968360e+05 7558.000000
25% 1.100835e+06 9416.000000
50% 2.853875e+06 10354.000000
75% 3.351452e+06 10996.000000
max 4.858344e+06 11830.000000
In [113]:
# Quality Control for the retained cells in the original space
cells_retained = ss_hcc_norm.columns
qc_metrics.loc[cells_retained].describe()
Out[113]:
total_counts n_genes
count 1.820000e+02 182.000000
mean 1.996438e+06 10809.527473
std 1.040057e+06 957.956139
min 3.421010e+05 7268.000000
25% 1.001764e+06 10334.000000
50% 1.974784e+06 10907.500000
75% 2.791580e+06 11498.250000
max 4.774799e+06 12629.000000

Cell Filtering Outcome Summary¶

  • A total of 227 cells were initially present.
  • After normalisation + filtering, 182 cells were retained, and 45 were dropped.

Dropped Cells:¶

  • Have higher average total counts (2.5M)
  • But lower number of expressed genes (mean ≈ 10,189)

Retained Cells:¶

  • Have slightly lower total counts (2.0M)
  • But more genes detected per cell (mean ≈ 10,810)

However, the differences are not large enough to make any claims.

Now we restrict the filtered matrix to the 3000 genes in ss_mcf7_norm in order to investigate whether there are any substantial differences in number of expressed genes per cell for the dropped vs retained cells.

In [114]:
# Subset filtered matrix to just the 3000 genes in ss_mcf7_norm
genes_to_keep = ss_hcc_norm.index
cells_to_check = ss_hcc_filt.columns.difference(ss_hcc_norm.columns)

subset = ss_hcc_filt.loc[genes_to_keep, cells_to_check]

# Number of expressed genes (non-zero) per dropped cell
nonzeros_per_cell = (subset > 0).sum(axis=0)
nonzeros_per_cell.describe()
Out[114]:
count      45.000000
mean      792.533333
std       116.710715
min       549.000000
25%       720.000000
50%       778.000000
75%       847.000000
max      1104.000000
dtype: float64
In [115]:
# Retained cells
subset_retained = ss_hcc_filt.loc[genes_to_keep, cells_retained]
# Number of expressed genes (non-zero) per retained cell
nonzeros_retained = (subset_retained > 0).sum(axis=0)
nonzeros_retained.describe()
Out[115]:
count     182.000000
mean      870.175824
std       115.900805
min       549.000000
25%       793.500000
50%       872.500000
75%       944.500000
max      1169.000000
dtype: float64

There are no significant differences between the number of expressed genes per cell for dropped vs retained cells in the 3000 gene space.

It is not clear to us why exactly those 45 cells were dropped.

Dropped vs Retained Genes¶

HVG selection identifies the genes that show the most variability across cells, relative to their average expression level.

In [116]:
# Step 1: Use our normalisation
ss_hcc_norm_full = adata_df.T

# Step 2: Convert to AnnData
adata = anndata.AnnData(X=np.log1p(ss_hcc_norm_full.T.astype(float)))
adata.var_names = ss_hcc_filt.index
adata.obs_names = ss_hcc_filt.columns

# Step 3: Run HVG selection on the full gene set
sc.pp.highly_variable_genes(
    adata,
    flavor='seurat',
    n_top_genes=3000,
    inplace=True
)

# Step 4: Filter to top 3,000 genes
adata = adata[:, adata.var['highly_variable']]

# Get the 3000 HVG gene names just selected by Scanpy
hvgs_from_filt = adata.var_names

# Get the original 3000 gene names from the ss_mcf7_norm matrix
hvgs_from_norm = ss_hcc_norm.index

# Compute overlap
overlap = hvgs_from_filt.intersection(hvgs_from_norm)
print(f"Overlap: {len(overlap)} / 3000")
print(f"Percenatge Overlap: {len(overlap)/ 3000 * 100}")
Overlap: 2080 / 3000
Percenatge Overlap: 69.33333333333334

The overlap for HCC1806 (69%) is slightly worse than for MCF7 (78%), but it is still significant and we can say with high confidence that HVG selection was part of feature reduction.

DropSeq¶

In this section, we load the pre-filtered, normalized expression matrices (top 3,000 genes) generated by Drop-seq for both MCF7 and HCC1806 lines. We then:

  • peek at the first few rows of each matrix to confirm that gene IDs and normalized counts look as expected
  • check the overall dimensions to ensure we have the correct number of cells and features

This quick sanity check ensures that our Drop-seq data are correctly loaded.

In [117]:
ds_mcf7_norm = pd.read_csv("AILab2025/DropSeq/MCF7_Filtered_Normalised_3000_Data_train.txt",delimiter=" ",engine='python',index_col=0)
ds_hcc_norm = pd.read_csv("AILab2025/DropSeq/HCC1806_Filtered_Normalised_3000_Data_train.txt",delimiter=" ",engine='python',index_col=0)
In [118]:
ds_mcf7_norm.shape
Out[118]:
(3000, 21626)
In [119]:
ds_mcf7_norm.head(5)
Out[119]:
AAAAACCTATCG_Normoxia AAAACAACCCTA_Normoxia AAAACACTCTCA_Normoxia AAAACCAGGCAC_Normoxia AAAACCTAGCTC_Normoxia AAAACCTCCGGG_Normoxia AAAACTCGTTGC_Normoxia AAAAGAGCTCTC_Normoxia AAAAGCTAGGCG_Normoxia AAAATCGCATTT_Normoxia ... TTTTACAGGATC_Hypoxia TTTTACCACGTA_Hypoxia TTTTATGCTACG_Hypoxia TTTTCCAGACGC_Hypoxia TTTTCGCGCTCG_Hypoxia TTTTCGCGTAGA_Hypoxia TTTTCGTCCGCT_Hypoxia TTTTCTCCGGCT_Hypoxia TTTTGTTCAAAG_Hypoxia TTTTTTGTATGT_Hypoxia
MALAT1 1 3 3 6 4 5 1 13 3 3 ... 0 2 1 0 1 0 1 0 0 4
MT-RNR2 0 0 0 2 0 0 2 1 7 0 ... 0 0 0 0 0 0 0 0 0 0
NEAT1 0 0 0 0 0 2 0 1 2 0 ... 0 0 0 0 0 0 0 0 0 0
H1-5 0 0 0 0 0 2 0 0 0 0 ... 0 1 0 0 1 0 0 1 0 0
TFF1 4 1 1 1 0 0 0 2 0 1 ... 2 3 8 0 0 3 4 2 6 0

5 rows × 21626 columns

In [120]:
ds_mcf7_norm.describe()
Out[120]:
AAAAACCTATCG_Normoxia AAAACAACCCTA_Normoxia AAAACACTCTCA_Normoxia AAAACCAGGCAC_Normoxia AAAACCTAGCTC_Normoxia AAAACCTCCGGG_Normoxia AAAACTCGTTGC_Normoxia AAAAGAGCTCTC_Normoxia AAAAGCTAGGCG_Normoxia AAAATCGCATTT_Normoxia ... TTTTACAGGATC_Hypoxia TTTTACCACGTA_Hypoxia TTTTATGCTACG_Hypoxia TTTTCCAGACGC_Hypoxia TTTTCGCGCTCG_Hypoxia TTTTCGCGTAGA_Hypoxia TTTTCGTCCGCT_Hypoxia TTTTCTCCGGCT_Hypoxia TTTTGTTCAAAG_Hypoxia TTTTTTGTATGT_Hypoxia
count 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 ... 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000
mean 0.034000 0.030333 0.027000 0.032333 0.045333 0.047333 0.030000 0.027333 0.032000 0.027333 ... 0.052333 0.043667 0.033667 0.033000 0.025333 0.037000 0.046333 0.055667 0.038000 0.033000
std 0.277254 0.220823 0.195662 0.233751 0.246235 0.299649 0.204403 0.292030 0.281074 0.237918 ... 0.364654 0.244499 0.340449 0.302117 0.208261 0.286924 0.301469 0.358623 0.240642 0.244808
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 4.000000 4.000000 5.000000 6.000000 4.000000 8.000000 6.000000 13.000000 7.000000 6.000000 ... 7.000000 4.000000 10.000000 8.000000 6.000000 7.000000 7.000000 9.000000 6.000000 6.000000

8 rows × 21626 columns

In [121]:
ds_hcc_norm.shape
Out[121]:
(3000, 14682)
In [122]:
ds_hcc_norm.head(5)
Out[122]:
AAAAAACCCGGC_Normoxia AAAACCGGATGC_Normoxia AAAACGAGCTAG_Normoxia AAAACTTCCCCG_Normoxia AAAAGCCTACCC_Normoxia AAACACAAATCT_Normoxia AAACCAAGCCCA_Normoxia AAACCATGCACT_Normoxia AAACCTCCGGCT_Normoxia AAACGCCGGTCC_Normoxia ... TTTTCTGATGGT_Hypoxia TTTTGATTCAGA_Hypoxia TTTTGCAACTGA_Hypoxia TTTTGCCGGGCC_Hypoxia TTTTGTTAGCCT_Hypoxia TTTTTACCAATC_Hypoxia TTTTTCCGTGCA_Hypoxia TTTTTGCCTGGG_Hypoxia TTTTTGTAACAG_Hypoxia TTTTTTTGAATC_Hypoxia
H1-5 2 2 5 1 0 0 0 0 1 0 ... 0 1 0 2 1 0 0 0 3 1
MALAT1 3 3 2 3 12 3 1 2 0 0 ... 3 1 1 1 4 0 4 1 3 6
MT-RNR2 0 0 0 0 0 0 0 0 0 1 ... 1 2 2 2 0 0 1 0 1 0
ARVCF 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
BCYRN1 0 1 1 0 0 1 1 2 0 3 ... 1 1 0 1 1 0 0 1 0 0

5 rows × 14682 columns

In [123]:
ds_hcc_norm.describe()
Out[123]:
AAAAAACCCGGC_Normoxia AAAACCGGATGC_Normoxia AAAACGAGCTAG_Normoxia AAAACTTCCCCG_Normoxia AAAAGCCTACCC_Normoxia AAACACAAATCT_Normoxia AAACCAAGCCCA_Normoxia AAACCATGCACT_Normoxia AAACCTCCGGCT_Normoxia AAACGCCGGTCC_Normoxia ... TTTTCTGATGGT_Hypoxia TTTTGATTCAGA_Hypoxia TTTTGCAACTGA_Hypoxia TTTTGCCGGGCC_Hypoxia TTTTGTTAGCCT_Hypoxia TTTTTACCAATC_Hypoxia TTTTTCCGTGCA_Hypoxia TTTTTGCCTGGG_Hypoxia TTTTTGTAACAG_Hypoxia TTTTTTTGAATC_Hypoxia
count 3000.00000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.00000 3000.000000 3000.000000 ... 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000
mean 0.02900 0.041667 0.024333 0.021667 0.029667 0.020000 0.036000 0.02600 0.034000 0.029333 ... 0.043000 0.049667 0.037000 0.047667 0.057000 0.023333 0.041667 0.041667 0.043333 0.040000
std 0.23276 0.309778 0.231860 0.189409 0.323761 0.170126 0.250449 0.23525 0.231362 0.218683 ... 0.271739 0.319219 0.279864 0.259648 0.304053 0.214797 0.236536 0.285116 0.267356 0.282418
min 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 5.00000 9.000000 7.000000 4.000000 12.000000 3.000000 4.000000 6.00000 4.000000 4.000000 ... 4.000000 7.000000 7.000000 4.000000 5.000000 4.000000 4.000000 5.000000 5.000000 6.000000

8 rows × 14682 columns

We immediately notice that Drop-seq data takes much smaller values and the majority of these are zeroes.

Sequencing Technology Comparison¶

Smart-seq captures full-length transcripts with high sensitivity, enabling detection of lowly expressed genes and isoform analysis, but is limited to fewer cells (250 and 182) due to higher cost and lower throughput. Drop-seq profiles thousands of cells (21626 and 14682) by capturing only the 3′ ends of transcripts and using UMIs for quantification. While Drop-seq is more scalable, it produces sparser data with lower gene detection per cell, making it more suitable for large-scale cell population studies than detailed transcriptomic analysis.

Let us compare the proportion of zero entries in our data for Smart-seq vs Drop-seq:

In [124]:
smart_zero_prop = (ss_mcf7_norm == 0).sum().sum() / ss_mcf7_norm.size
drop_zero_prop = (ds_mcf7_norm == 0).sum().sum() / ds_mcf7_norm.size

print(f"Sparsity (Smart-seq MCF7): {smart_zero_prop:.2%}")
print(f"Sparsity (Drop-seq MCF7):  {drop_zero_prop:.2%}")
Sparsity (Smart-seq MCF7): 63.62%
Sparsity (Drop-seq MCF7):  97.53%
In [125]:
smart_zero_prop = (ss_hcc_norm == 0).sum().sum() / ss_hcc_norm.size
drop_zero_prop = (ds_hcc_norm == 0).sum().sum() / ds_hcc_norm.size

print(f"Sparsity (Smart-seq HCC1806): {smart_zero_prop:.2%}")
print(f"Sparsity (Drop-seq HCC1806):  {drop_zero_prop:.2%}")
Sparsity (Smart-seq HCC1806): 70.65%
Sparsity (Drop-seq HCC1806):  97.64%

Drop-Seq data is indeed more sparse.

Label Distribution¶

Now let us check whether the classes (hypoxia, normoxia) are balanced in these datasets:

In [126]:
hypo_count = sum('hypo' in col.lower() for col in ds_mcf7_norm.columns)
norm_count = sum('norm' in col.lower() for col in ds_mcf7_norm.columns)

print(f"Hypoxic samples: {hypo_count}")
print(f"Normoxic samples: {norm_count}")

total = hypo_count + norm_count
print(f"Hypoxic: {hypo_count/total:.2%}, Normoxic: {norm_count/total:.2%}")
Hypoxic samples: 8921
Normoxic samples: 12705
Hypoxic: 41.25%, Normoxic: 58.75%
In [127]:
hypo_count = sum('hypo' in col.lower() for col in ds_hcc_norm.columns)
norm_count = sum('norm' in col.lower() for col in ds_hcc_norm.columns)

print(f"Hypoxic samples: {hypo_count}")
print(f"Normoxic samples: {norm_count}")

total = hypo_count + norm_count
print(f"Hypoxic: {hypo_count/total:.2%}, Normoxic: {norm_count/total:.2%}")
Hypoxic samples: 8899
Normoxic samples: 5783
Hypoxic: 60.61%, Normoxic: 39.39%

Both Drop-seq datasets show moderate class imbalance. In MCF7, normoxic cells are more abundant, while in HCC1806, hypoxic cells dominate. This imbalance may influence classifier learning dynamics, potentially biasing models toward the majority class if not handled carefully.

Correlation between gene expression profiles¶

Mean pairwise cell–cell correlation is the average Pearson correlation between the expression profiles of all pairs of cells.

In [128]:
def mean_pairwise_correlation(df, method='pearson'):
    """
    Computes the mean pairwise correlation between columns.

    Parameters:
        df (pd.DataFrame): The expression matrix.
        method (str): 'pearson', 'spearman', or 'kendall'.

    Returns:
        float: Mean pairwise correlation (excluding self-correlations).
    """
    cor_matrix = df.corr(method=method)
    upper_tri_values = cor_matrix.values[np.triu_indices_from(cor_matrix, k=1)]
    return upper_tri_values.mean()
In [129]:
print("Mean pairwise cell–cell correlation")
print(f"Raw MCF7: {mean_pairwise_correlation(ss_mcf7_raw):.4f}")
print(f"Filtered MCF7: {mean_pairwise_correlation(ss_mcf7_filt):.4f}")
print(f"Normalised MCF7: {mean_pairwise_correlation(ss_mcf7_norm):.4f}")
print(f"Raw HCC1806: {mean_pairwise_correlation(ss_hcc_raw):.4f}")
print(f"Filetered HCC1806: {mean_pairwise_correlation(ss_hcc_filt):.4f}")
print(f"Normalised HCC1806: {mean_pairwise_correlation(ss_hcc_norm):.4f}")
Mean pairwise cell–cell correlation
Raw MCF7: 0.6720
Filtered MCF7: 0.7211
Normalised MCF7: 0.6654
Raw HCC1806: 0.7405
Filetered HCC1806: 0.7985
Normalised HCC1806: 0.7480

In both MCF7 and HCC1806, going from raw to filtered increases correlation. This indicates:

  • Low-quality or noisy cells were successfully removed
  • The retained cells share more biologically consistent expression profiles

After normalization and HVG selection, the mean pairwise correlation between cells drops slightly in both cell lines. This reflects two effects: normalization removes global differences in sequencing depth, reducing technical variability; meanwhile, highly variable gene (HVG) filtering discards stable, housekeeping genes and retains genes that emphasize biological differences between cells. Together, these steps increase the biological resolution of the data but can reduce overall similarity across cells.

Nevertheless, all versions of the MCF7 and HCC1806 Smart-seq data (raw, filtered, and filtered + normalized) exhibit strong mean pairwise cell–cell correlations (~0.6–0.8). This confirms good internal consistency (cohesive cell populations, low noise) across samples and suggests that preprocessing steps effectively preserved the underlying biological structure of the data.

Unsupervised Learning¶

PCA¶

To distill the high-dimensional single-cell expression matrices into their most informative axes, we apply PCA across all four dataset–condition combinations. Our goals are:

  • Variance capture: Determine the smallest set of PCs explaining ≥95% of the total variance, thereby retaining the bulk of biological signal while discarding noise and redundancy for downstream modeling.
  • Preprocessing impact: Compare unscaled vs. unit-variance-scaled PCA to understand how per-gene normalization redistributes variance and affects our ability to separate hypoxia vs. normoxia states.

Unit-variance scaling equalizes gene variances - potentially up-weighting technical noise from lowly expressed genes - whereas unscaled PCA preserves the raw variance structure, emphasizing dominant biological patterns. By quantifying both the cumulative‐variance and the classification power of leading PCs in each regime, we assess the robustness of hypoxia signals and provide practical guidance for preprocessing choices in single-cell analyses.

Helper functions¶

This cell defines a set of helper functions for streamlined PCA-based analysis and diagnostic evaluation:

  1. get_top_components

    • Runs PCA on an AnnData object
    • Plots a scree plot of explained variance
    • Computes and stores the number of PCs required to explain 95% of variance
    • Saves both the count and the reduced coordinates in adata
  2. add_labels

    • Parses cell names in adata.obs_names
    • Adds a binary condition column (“Hypo” vs. “Norm”) for later evaluation
  3. best_linear_pc_split

    • Performs pairwise logistic‐regression on the first max_pc PCs
    • Uses cross-validation to find which two PCs best separate “Hypo” vs. “Norm”
  4. best_split

    • Wraps best_linear_pc_split
    • Prints the best PC pair and CV score
    • Generates a scatterplot of cells on those two PCs, colored by condition

These functions allow us to run unsupervised PCA and perform a supervised check to evaluate how our top components capture the hypoxia vs. normoxia signal.

In [130]:
warnings.filterwarnings('ignore')

def get_top_components(adata, n_pcs, plot=True):
    """this function runs PCA and returns the number of components needed to explain 95% of the variance
    Parameters: 
    adata: AnnData object
    n_pcs: number of principal components to compute
    plot: whether to plot the variance explained per PC (scree plot)
    Returns:
    n_components_95: number of components needed to explain 95% of the variance
    Additionally it adds the following to the adata object:
    adata.uns['pca']['n_components_95']: number of components needed to explain 95% of the variance
    adata.obsm['X_pca_95']: PCA coordinates for the first n_components_95
    """
    # plot variance explained per PC (scree plot)
    if plot:
        sc.pl.pca_variance_ratio(adata, n_pcs=n_pcs, log=False)

    # access the variance ratios
    explained_var = adata.uns['pca']['variance_ratio']  # array of variance explained per PC

    # cmpute cumulative sum
    cumulative_var = np.cumsum(explained_var)
    
    # find the number of PCs needed to reach 95% variance
    n_components_95 = np.argmax(cumulative_var >= 0.95) + 1  # +1 because np.argmax is 0-based
    print(f"Number of PCs needed to explain 95% variance: {n_components_95}")

    # add to the adata object this information
    adata.uns['pca']['n_components_95'] = n_components_95
    adata.obsm['X_pca_95'] = adata.obsm['X_pca'][:, :n_components_95]

    for i, var in enumerate(explained_var[:10]):
        print(f"PC{i+1} explains: {var*100:.2f}%")
    return n_components_95

def add_labels(adata):
    """this function will add labels to the adata object based on the cell names
    """
    # add labels based on cell names
    adata.obs['condition'] = ['Hypo' if 'hypo' in name.lower() else 'Norm' for name in adata.obs_names]

def best_linear_pc_split(adata, label_key='condition', max_pc=20):
    """
    Finds the best pair of principal components (among first `max_pc`) for linearly separating the data.

    Parameters:
    - adata: AnnData object with PCA computed (`adata.obsm['X_pca']` must exist).
    - label_key: column in `adata.obs` used as target for classification.
    - max_pc: maximum number of PCs to consider (default=20).

    Returns:
    - best_pair: tuple (pc1, pc2) with 1-based PC indices.
    - best_score: mean cross-validation accuracy.
    """

    if 'X_pca' not in adata.obsm:
        raise ValueError("Run sc.tl.pca(adata) first to compute PCA")

    X_pca = adata.obsm['X_pca'][:, :max_pc]
    y = adata.obs[label_key].values

    best_pair = None
    best_score = -np.inf

    for i, j in combinations(range(max_pc), 2):
        X_pair = X_pca[:, [i, j]]
        clf = LogisticRegression(max_iter=1000)
        score = cross_val_score(clf, X_pair, y, cv=5).mean()

        if score > best_score:
            best_score = score
            best_pair = (i + 1, j + 1)  # return 1-based PC indices

    return best_pair, best_score

def best_split(adata, label_key='condition', max_pc=20):
    """draws the best split plot"""
    best_pair, best_score = best_linear_pc_split(adata, label_key=label_key, max_pc=max_pc)
    print(f"Best pair of PCs: {best_pair} with score: {best_score:.4f}")
    plt.figure(figsize=(8, 6))
    sns.scatterplot(x=adata.obsm['X_pca'][:, best_pair[0] - 1],
                    y=adata.obsm['X_pca'][:, best_pair[1] - 1],
                    hue=adata.obs[label_key],
                    palette='Set2',
                    s = 10)
    plt.title(f"Best pair of PCs: {best_pair} with score: {best_score:.4f}")
    plt.xlabel(f"PC{best_pair[0]}")
    plt.ylabel(f"PC{best_pair[1]}")
    plt.legend(title='condition')
    plt.show()

SmartSeq¶

SmartSeq MCF7¶

We start with the non-scaled PCA:

In [131]:
X = ss_mcf7_norm.T               # cells × genes
adata_ss_mcf7 = ad.AnnData(X) 
adata_ss_mcf7.obs_names = ss_mcf7_norm.columns      # adding cell names
adata_ss_mcf7.var_names = ss_mcf7_norm.index        # adding gene names

add_labels(adata_ss_mcf7)           # adding condition labels
sc.pp.pca(adata_ss_mcf7, n_comps=50, svd_solver='arpack', use_highly_variable=False)
get_top_components(adata_ss_mcf7, n_pcs=50, plot=True) 

# plot PCA
best_split(adata_ss_mcf7, label_key='condition', max_pc=10)
No description has been provided for this image
Number of PCs needed to explain 95% variance: 20
PC1 explains: 63.45%
PC2 explains: 9.11%
PC3 explains: 6.27%
PC4 explains: 4.03%
PC5 explains: 3.16%
PC6 explains: 1.54%
PC7 explains: 1.14%
PC8 explains: 1.01%
PC9 explains: 0.90%
PC10 explains: 0.76%
Best pair of PCs: (1, 6) with score: 0.9920
No description has been provided for this image

The first 20 principal components capture ≥95% of the total variance in the untransformed, unscaled data. The best 2-PC split is PC1 vs PC6 with 0.992 accuracy - these two axes yield nearly perfect linear separation of ‘Hypo’ vs ‘Norm’ cells. The unscaled data are strongly dominated by PC1 (over 60% variance), suggesting a single major gradient separates the samples. Yet PC6 adds the extra discriminative power needed for clean classification.

Now let's look at the scaled option:

In [132]:
X_log = np.log1p(ss_mcf7_norm.T)        # cells × genes
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_log)       # scaling the data
adata_ss_mcf7_scaled = ad.AnnData(X_log)
adata_ss_mcf7_scaled.obs_names = ss_mcf7_norm.columns       # adding cell names
adata_ss_mcf7_scaled.var_names = ss_mcf7_norm.index         # adding gene names

add_labels(adata_ss_mcf7_scaled) # adding condition labels
sc.pp.pca(adata_ss_mcf7_scaled, n_comps=249, svd_solver='arpack', use_highly_variable=False)
get_top_components(adata_ss_mcf7_scaled, n_pcs=249, plot=True)

# plot PCA 
best_split(adata_ss_mcf7_scaled, label_key='condition', max_pc=50)
No description has been provided for this image
Number of PCs needed to explain 95% variance: 204
PC1 explains: 18.64%
PC2 explains: 5.55%
PC3 explains: 4.02%
PC4 explains: 2.24%
PC5 explains: 1.77%
PC6 explains: 1.47%
PC7 explains: 1.26%
PC8 explains: 1.16%
PC9 explains: 1.04%
PC10 explains: 0.97%
Best pair of PCs: (1, 2) with score: 0.9920
No description has been provided for this image

Enforcing unit variance per gene spreads the variance more evenly across many components, leaving us with 204 PCs for explaining >95% of variance. The best 2-PC split is achieved with PC1 and PC2 and yields 0.992 accuracy (same as unscaled one), giving us nearly perfect separation. In short, scaling dramatically reduces the dominance of PC1 (down from 63% to 19%) and distributes signal into higher PCs. As a result, the “best” discriminative axes shift - PC2 (rather than PC6) now carries enough of the hypoxia signal to pair with PC1.

SmartSeq HCC1806¶

Similarly, we begin with non-scaled:

In [133]:
data = ss_hcc_norm

X = data.T          # cells × genes
adata_ss_hcc = ad.AnnData(X) 
adata_ss_hcc.obs_names = data.columns          # adding cell names
adata_ss_hcc.var_names = data.index            # adding gene names

add_labels(adata_ss_hcc)        # adding condition labels
sc.pp.pca(adata_ss_hcc, n_comps=50, svd_solver='arpack', use_highly_variable=False)
get_top_components(adata_ss_hcc, n_pcs=50, plot=True)  

# plot PCA
best_split(adata_ss_hcc, label_key='condition', max_pc=10)
No description has been provided for this image
Number of PCs needed to explain 95% variance: 34
PC1 explains: 29.02%
PC2 explains: 18.10%
PC3 explains: 12.29%
PC4 explains: 7.97%
PC5 explains: 4.96%
PC6 explains: 3.64%
PC7 explains: 2.74%
PC8 explains: 2.11%
PC9 explains: 1.74%
PC10 explains: 1.32%
Best pair of PCs: (2, 3) with score: 0.9450
No description has been provided for this image

The first 34 components capture ≥95% of the raw data’s variance. Best 2-PC split is achieved with PC2 and PC3 and 0.945 accuracy. PCs 2 & 3 together yield a clear, though slightly less perfect, separation of Hypo vs Norm cells compared to the MCF7 line. Here variance is less dominated by PC1 (29% vs. 63% before), and PCs 2 & 3 carry strong hypoxia signals. This suggests HCC1806 biology is more multifaceted: multiple axes beyond the first contribute meaningfully to condition differences.

Now the scaled option:

In [134]:
data = ss_hcc_norm
scaler = StandardScaler()
X_log = np.log1p(data.T)        # cells × genes
X_scaled = scaler.fit_transform(X_log)          # scaling the data
adata_ss_hcc_scaled = ad.AnnData(X_scaled)
adata_ss_hcc_scaled.obs_names = data.columns        # adding cell names
adata_ss_hcc_scaled.var_names = data.index       # adding gene names

add_labels(adata_ss_hcc_scaled)      # adding condition labels
sc.pp.pca(adata_ss_hcc_scaled, n_comps=181, svd_solver='arpack', use_highly_variable=False)
get_top_components(adata_ss_hcc_scaled, n_pcs=181)

# plot PCA 
best_split(adata_ss_hcc_scaled, label_key='condition', max_pc=10)
No description has been provided for this image
Number of PCs needed to explain 95% variance: 161
PC1 explains: 4.50%
PC2 explains: 3.23%
PC3 explains: 2.56%
PC4 explains: 1.97%
PC5 explains: 1.55%
PC6 explains: 1.46%
PC7 explains: 1.31%
PC8 explains: 1.21%
PC9 explains: 1.07%
PC10 explains: 0.99%
Best pair of PCs: (2, 3) with score: 0.9012
No description has been provided for this image

Number of PCs for 95% variance is now 161, which means that equalizing per-gene variance scatters signal across many more dimensions.Even after scaling, PCs 2 & 3 remain the optimal discriminators, though accuracy drops modestly (from 0.945 to 0.9012).

DropSeq¶

DropSeq MCF7¶

The non-scaled analysis:

In [135]:
data = ds_mcf7_norm

X = data.T          # cells × genes
print(X.shape)
adata_ds_mcf7 = ad.AnnData(X) 
adata_ds_mcf7.obs_names = data.columns      # adding cell names
adata_ds_mcf7.var_names = data.index        # adding gene names

add_labels(adata_ds_mcf7)       # adding condition labels
sc.pp.pca(adata_ds_mcf7, n_comps=800, svd_solver='arpack', use_highly_variable=False)
get_top_components(adata_ds_mcf7, n_pcs=800, plot=True)  
# plot PCA
best_split(adata_ds_mcf7, label_key='condition', max_pc=10)
(21626, 3000)
No description has been provided for this image
Number of PCs needed to explain 95% variance: 761
PC1 explains: 24.95%
PC2 explains: 8.63%
PC3 explains: 4.43%
PC4 explains: 2.63%
PC5 explains: 2.13%
PC6 explains: 1.42%
PC7 explains: 1.30%
PC8 explains: 1.02%
PC9 explains: 0.92%
PC10 explains: 0.85%
Best pair of PCs: (2, 3) with score: 0.8690
No description has been provided for this image

A very large number of components is required (761), because the raw DropSeq matrix is sparse and high-dimensional. PC2 and PC3 together give a decent but noisier separation compared to SmartSeq, with also a lower accuracy 0.869.

Now the scaled one:

In [136]:
data = ds_mcf7_norm
X_log = np.log1p(data.T)        # cells × genes
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_log)          # scaling the data
adata_ds_mcf7_scaled = ad.AnnData(X_scaled)
adata_ds_mcf7_scaled.obs_names = data.columns       # adding cell names
adata_ds_mcf7_scaled.var_names = data.index         # adding gene names

add_labels(adata_ds_mcf7_scaled)        # adding condition labels
sc.pp.pca(adata_ds_mcf7_scaled, n_comps=2999, svd_solver='arpack', use_highly_variable=False)
get_top_components(adata_ds_mcf7_scaled, n_pcs=2999)

# plot PCA
best_split(adata_ds_mcf7_scaled, label_key='condition', max_pc=10)
No description has been provided for this image
Number of PCs needed to explain 95% variance: 2652
PC1 explains: 0.73%
PC2 explains: 0.29%
PC3 explains: 0.17%
PC4 explains: 0.12%
PC5 explains: 0.09%
PC6 explains: 0.09%
PC7 explains: 0.08%
PC8 explains: 0.08%
PC9 explains: 0.07%
PC10 explains: 0.07%
Best pair of PCs: (1, 3) with score: 0.9681
No description has been provided for this image

Unit-variance scaling massively flattens the variance curve, so almost every PC contributes a tiny amount, resulting in needing 2652 PCs to explain >95% variance. Separability, however, improves slightly after scaling, likely because it down-weights very high-variance (noisy) genes: the 2-PC best split PCs are PC1 & PC3 with accuracy 0.9681. In this run, PC1 carries a lot of the hypoxia signal (so pairing it with PC3 gives a cleaner separation than PC2 ever did).

DropSeq HCC1806¶

Last set of cells, we begin with non-scaled PCA:

In [137]:
data = ds_hcc_norm

X = data.T         # cells × genes
adata_ds_hcc = ad.AnnData(X) 
adata_ds_hcc.obs_names = data.columns       # adding cell names
adata_ds_hcc.var_names = data.index         # adding gene names

add_labels(adata_ds_hcc)        # adding condition labels
sc.pp.pca(adata_ds_hcc, n_comps=900, svd_solver='arpack', use_highly_variable=False)
get_top_components(adata_ds_hcc, n_pcs=900, plot=True)  

# plot PCA
best_split(adata_ds_hcc, label_key='condition', max_pc=10)
No description has been provided for this image
Number of PCs needed to explain 95% variance: 844
PC1 explains: 8.42%
PC2 explains: 6.32%
PC3 explains: 3.88%
PC4 explains: 2.87%
PC5 explains: 2.38%
PC6 explains: 2.16%
PC7 explains: 1.67%
PC8 explains: 1.48%
PC9 explains: 1.44%
PC10 explains: 1.25%
Best pair of PCs: (5, 6) with score: 0.8087
No description has been provided for this image

The raw data require 844 components to reach 95% cumulative variance, reflecting widespread noise. From the 2-PC split we got 0.8087 accuracy: PCs 5 & 6 yield the clearest linear separation under no scaling, showing that subtle higher-order components contain the hypoxia signal.

Continue with scaled:

In [138]:
data = ds_hcc_norm
X_log = np.log1p(data.T)    # cells × genes
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_log)       # scaling the data
adata_ds_hcc_scaled = ad.AnnData(X_scaled) 
adata_ds_hcc_scaled.obs_names = data.columns           # adding cell names
adata_ds_hcc_scaled.var_names = data.index             # adding gene names

add_labels(adata_ds_hcc_scaled)         # adding condition labels
sc.pp.pca(adata_ds_hcc_scaled, n_comps=2999, svd_solver='arpack', use_highly_variable=False)
get_top_components(adata_ds_hcc_scaled, n_pcs=2999, plot=True)  

# plot PCA
best_split(adata_ds_hcc_scaled, label_key='condition', max_pc=10)
No description has been provided for this image
Number of PCs needed to explain 95% variance: 2595
PC1 explains: 0.39%
PC2 explains: 0.27%
PC3 explains: 0.17%
PC4 explains: 0.15%
PC5 explains: 0.09%
PC6 explains: 0.09%
PC7 explains: 0.08%
PC8 explains: 0.08%
PC9 explains: 0.08%
PC10 explains: 0.08%
Best pair of PCs: (3, 4) with score: 0.8932
No description has been provided for this image

Unit-variance scaling flattens the variance curve almost completely, so nearly every PC contributes a sliver. This re-weighting sharpens separation (≈89% accuracy) compared to the unscaled case, giving the best 2-PC split with PC3 & PC4.

DropSeq issue¶

When we ran PCA on the full (Drop-seq) dataset and asked for enough components to cover 95% of the variance, the algorithm returned thousands of PCs. In practice, those later dimensions:

  • Capture very low signal-to-noise ratios (technical noise, drop-out events)
  • Tend to drown out the biologically meaningful structure when fed into UMAP or t-SNE
  • Greatly increase computation time and destabilize embeddings

Instead of a blanket “95% variance” cutoff, we’ll now use the scree plot (per-PC variance curve) to pick the elbow point—the PC after which each additional axis contributes only vanishing gains. This approach:

  1. Denoises by discarding the long tail of tiny, likely artifactual components
  2. Speeds up UMAP/t-SNE and yields more reproducible layouts
  3. Focuses on the axes that capture true biological variation (cell-state differences, treatment effects)

Below we generate the scree & cumulative variance plots, detect the elbow and run our new PCA before proceeding to UMAP/t-SNE on those top components.

In [139]:
sc.tl.pca(adata_ds_mcf7_scaled, n_comps=200, svd_solver='full')

evr    = adata_ds_mcf7_scaled.uns['pca']['variance_ratio']        # shape
cumvar = np.cumsum(evr)                           # cumulative sum
pcs    = np.arange(1, len(evr) + 1)               # [1,2,3,...]

fig, ax1 = plt.subplots(figsize=(6,4))
ax1.plot(pcs, evr,    '-o', label='per‐PC var.')
ax1.set_xlabel('PC number');   ax1.set_ylabel('Explained variance ratio')
ax1.axvline(20, color='gray', linestyle='--', alpha=0.5)

ax2 = ax1.twinx()
ax2.plot(pcs, cumvar, '-s', c='C1', label='cumulative var.')
ax2.set_ylabel('Cumulative variance')

ax1.legend(loc='upper left');  ax2.legend(loc='lower right')
plt.title('Scree & cumulative variance')
plt.show()
No description has been provided for this image
In [140]:
sc.tl.pca(adata_ds_hcc_scaled, n_comps=200, svd_solver='full')

evr    = adata_ds_hcc_scaled.uns['pca']['variance_ratio']        # shape
cumvar = np.cumsum(evr)                           # cumulative sum
pcs    = np.arange(1, len(evr) + 1)               # [1,2,3,...]

fig, ax1 = plt.subplots(figsize=(6,4))
ax1.plot(pcs, evr,    '-o', label='per‐PC var.')
ax1.set_xlabel('PC number');   ax1.set_ylabel('Explained variance ratio')
ax1.axvline(20, color='gray', linestyle='--', alpha=0.5)

ax2 = ax1.twinx()
ax2.plot(pcs, cumvar, '-s', c='C1', label='cumulative var.')
ax2.set_ylabel('Cumulative variance')

ax1.legend(loc='upper left');  ax2.legend(loc='lower right')
plt.title('Scree & cumulative variance')
plt.show()
No description has been provided for this image

From our scree‐plots:

  • MCF-7: the per-PC variance curve flattens out around PC 4–5
  • HCC1806: the elbow appears near PC 5–6

To remain conservative (i.e. not risk dropping any real biological signal) and keep our workflow consistent across both datasets, we will use 10 PCs for all downstream steps (UMAP, t-SNE, clustering). This ensures:

  1. Coverage of all clearly informative axes (including a small safety buffer beyond the elbow)
  2. Robustness against dataset-specific noise peaks
  3. Cohesion in parameter choice across MCF-7 and HCC1806

Below, we re-run PCA with n_comps=10 and proceed to UMAP/t-SNE on those top 10 components.

In [141]:
data = ds_mcf7_norm
X_log = np.log1p(data.T)        # cells × genes
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_log)          # scaling the data
adata_ds_mcf7_scaled = ad.AnnData(X_scaled)
adata_ds_mcf7_scaled.obs_names = data.columns       # adding cell names
adata_ds_mcf7_scaled.var_names = data.index         # adding gene names

add_labels(adata_ds_mcf7_scaled)        # adding condition labels
sc.pp.pca(adata_ds_mcf7_scaled, n_comps=10, svd_solver='arpack', use_highly_variable=False)
In [142]:
data = ds_hcc_norm
X_log = np.log1p(data.T)    # cells × genes
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_log)       # scaling the data
adata_ds_hcc_scaled = ad.AnnData(X_scaled) 
adata_ds_hcc_scaled.obs_names = data.columns           # adding cell names
adata_ds_hcc_scaled.var_names = data.index             # adding gene names

add_labels(adata_ds_hcc_scaled)         # adding condition labels
sc.pp.pca(adata_ds_hcc_scaled, n_comps=10, svd_solver='arpack', use_highly_variable=False)

Results summary¶

Empirically, we compared PCA with and without scaling across four dataset‐condition combinations (SmartSeq vs. DropSeq, MCF7 vs. HCC1806) and found that, despite concerns over unit‐variance scaling amplifying technical noise, the unscaled PCA still captures hypoxia‐related variation robustly in its leading components. Overall, scaling tends to:

  • Spread variance more evenly across many components (e.g. SmartSeq MCF7 rises from 20 → 204 PCs for 95% variance).
  • Shift the optimal separation axes (e.g. MCF7 switches from PC(1,6) unscaled to PC(1,2) scaled).
  • Modestly affect classification accuracy: in the noisier DropSeq datasets, scaling actually improves separability (MCF7: 0.87 → 0.90; HCC1806: 0.81 → 0.89), whereas in SmartSeq lines accuracy remains near‐perfect (0.99).
In [143]:
summary_df = pd.DataFrame([
    {"Dataset": "SmartSeq MCF7",     "Scaling": "Unscaled", "PCs for 95% var": 20,   "Best PC pair": "1,6", "Accuracy": 0.992},
    {"Dataset": "SmartSeq MCF7",     "Scaling": "Scaled",   "PCs for 95% var": 204,  "Best PC pair": "1,2", "Accuracy": 0.992},
    {"Dataset": "SmartSeq HCC1806",  "Scaling": "Unscaled", "PCs for 95% var": 34,   "Best PC pair": "2,3", "Accuracy": 0.945},
    {"Dataset": "SmartSeq HCC1806",  "Scaling": "Scaled",   "PCs for 95% var": 161,  "Best PC pair": "2,3", "Accuracy": 0.901},
    {"Dataset": "DropSeq MCF7",      "Scaling": "Unscaled", "PCs for 95% var": 761,  "Best PC pair": "2,3", "Accuracy": 0.869},
    {"Dataset": "DropSeq MCF7",      "Scaling": "Scaled",   "PCs for 95% var": 2652, "Best PC pair": "2,3", "Accuracy": 0.901},
    {"Dataset": "DropSeq HCC1806",   "Scaling": "Unscaled", "PCs for 95% var": 844,  "Best PC pair": "5,6", "Accuracy": 0.809},
    {"Dataset": "DropSeq HCC1806",   "Scaling": "Scaled",   "PCs for 95% var": 2595, "Best PC pair": "3,4", "Accuracy": 0.893},
])

summary_df
Out[143]:
Dataset Scaling PCs for 95% var Best PC pair Accuracy
0 SmartSeq MCF7 Unscaled 20 1,6 0.992
1 SmartSeq MCF7 Scaled 204 1,2 0.992
2 SmartSeq HCC1806 Unscaled 34 2,3 0.945
3 SmartSeq HCC1806 Scaled 161 2,3 0.901
4 DropSeq MCF7 Unscaled 761 2,3 0.869
5 DropSeq MCF7 Scaled 2652 2,3 0.901
6 DropSeq HCC1806 Unscaled 844 5,6 0.809
7 DropSeq HCC1806 Scaled 2595 3,4 0.893

Across all conditions, the hypoxia signature consistently emerges in the first handful of unscaled PCs - even in sparse DropSeq - confirming that PCA alone can recover our phenotype without fancy normalization. Scaling can unearth subtler axes in noisy data, but at the cost of inflating minor technical variance. Analizing also the DropSeq issue we encountered, we can safely impose using only 10 PCs for the downstream analysis of those two datasets.

Data & Parameter Choices for Downstream Analysis

  • Smart-seq
    We will proceed unscaled (no zero-centring/unit-variance) since this data was already normalized and yields robust results without additional scaling.

  • Drop-seq
    We will use StandardScaler (zero-centering and unit-variance scaling) followed by PCA with n_comps=10, as described above, to denoise and stabilize our embeddings.

K-NN graph¶

Building a k-NN graph is a necessary preprocessing step for graph-based dimensionality reduction methods like UMAP and t-SNE, which leverage local cell neighborhoods.

We use Scanpy’s pp.neighbors() function and focus on three key parameters:

  • n_pcs: Number of principal components from PCA used as the embedding space for neighbor calculations. We will not pass it directly, rather accessing the X_pca_95 obsm wrapped in the adata object for the Smart Seq Datasets, and X_pca for Drop Seq ones.

  • n_neighbors: Number of nearest neighbors per cell to include in the graph. A common heuristic is to set
    $$n\_neighbors \approx \sqrt{N},$$
    where (N) is the total number of cells in the dataset .

  • metric: Distance metric for computing pairwise cell distances in PC space, Euclidean is the standard and the one we will use in our implementation.

functions¶

In [144]:
# printing the shape of the datasets to look at the number of cells and genes
print(ds_hcc_norm.shape,
ds_mcf7_norm.shape,
ss_hcc_norm.shape,
ss_mcf7_norm.shape)
(3000, 14682) (3000, 21626) (3000, 182) (3000, 250)
In [145]:
def build_and_diagnose_knn(
    adata,
    n_neighbors,
    metric="euclidean",
    use_rep="X_pca_95",
    random_state=42
):
    """
    Build a k-NN graph on adata.obsm['X_pca'] and then produce:
      1) degree‐distribution histogram
      2) adjacency heatmap for a random subset of cells
      3) spring‐layout network plot for that subset

    Parameters
    ----------
    adata : AnnData
        Must have PCA in adata.obsm['X_pca'].
    n_neighbors : int
        k for the k-NN graph.
    metric : str
        Distance metric for k-NN.
    random_state : int
        Seed for reproducibility.
    Returns
    -------
    None
    Additionally it adds the following to the adata object:
    adata.uns['neighbors']['degree_distribution']: degree distribution of the k-NN graph
    adata.obsm["connectivities"]: connectivity matrix of the k-NN graph
    """
    if use_rep is None:
        use_rep = "X_pca_95"  # default representation for k-NN graph
    # Check if elbow point is already computed

    # 1) build the graph
    sc.pp.neighbors(
        adata,
        n_neighbors=n_neighbors,
        metric='euclidean',  
        method='umap',
        knn=True,
        use_rep=f"{use_rep}",
        random_state=random_state
    )

MCF7 Smart Seq¶

In [146]:
build_and_diagnose_knn(
    adata_ss_mcf7,
    n_neighbors=int(np.sqrt(250)), # sqrt(250) = 15.81
    metric="euclidean",
    random_state=42
)

HCC1806 Smart Seq¶

In [147]:
build_and_diagnose_knn(
    adata_ss_hcc,
    n_neighbors=int(np.sqrt(ss_hcc_norm.shape[1])), #sqrt(181) = 13.45
    metric="euclidean",
    random_state=42
)

MCF7 Drop Seq¶

In [148]:
build_and_diagnose_knn(
    adata_ds_mcf7_scaled,
    n_neighbors=int(np.sqrt(ds_mcf7_norm.shape[1])), # sqrt(21626) = 147.0
    metric="euclidean",
    use_rep="X_pca",
    random_state=42
)

HCC11806 Dropseq¶

In [149]:
build_and_diagnose_knn(
    adata_ds_hcc_scaled,
    n_neighbors=int(np.sqrt(ds_hcc_norm.shape[1])), # sqrt(14682) = 121.2
    use_rep="X_pca",
    metric="euclidean",
    random_state=42
)

t-SNE¶

t-SNE (t-distributed Stochastic Neighbor Embedding) is a non-linear dimensionality reduction method. It is used only for visualization, not for training models. It projects high-dimensional data (e.g. 3000 genes) into 2 or 3 dimensions while preserving local structure — meaning that similar cells stay close together. It works by converting pairwise similarities into probabilities and minimizing the Kullback–Leibler divergence between high and low-dimensional distributions.

In our experiments, t-SNE reveals strong separation in datasets, with hypoxic and normoxic cells forming distinct clusters.

t-SNE’s effectiveness is highly sensitive to its parameters. Perplexity controls the balance between local and global aspects of the data (similar to the number of nearest neighbors considered), and different values can yield drastically different embeddings. As a standard choice we used 30 but then tried different parameter for every dataset in order to get the best visual split.

functions¶

In [150]:
def t_sne(adata, title="", perplexity=30, use_rep="X_pca_95", random_state=42):
    """
    Run t-SNE on the PCA-reduced data and plot the results.
    Parameters
    ----------
    adata : AnnData
        The AnnData object containing PCA-reduced data.
    title : str
        Title for the plot.
    perplexity : int
        Perplexity parameter for t-SNE.
    Returns
    -------
    adata : AnnData
        The AnnData object with t-SNE coordinates added.
    """
    sc.tl.tsne(
        adata,
        use_rep=use_rep,
        perplexity=perplexity,
        n_pcs=adata.uns["pca"]["n_components_95"] if use_rep == "X_pca_95" else None,
        random_state=42)

    sc.pl.tsne(
        adata,
        color='condition',
        show=False,
        size=10,
        title=f"t-SNE: {title}"
    )
    return adata

MCF7 Smart Seq¶

The t-SNE projection of the raw (unscaled) MCF7 Smart-Seq dataset reveals two distinct regions in the embedding space, corresponding to cells under hypoxia and normoxia conditions. Notably, two cells appear closer to the normoxia region, deviating slightly from the expected separation.

In [151]:
t_sne(adata_ss_mcf7, title="MCF7 Smart Seq unscaled", perplexity=50)
Out[151]:
AnnData object with n_obs × n_vars = 250 × 3000
    obs: 'condition'
    uns: 'pca', 'neighbors', 'tsne', 'condition_colors'
    obsm: 'X_pca', 'X_pca_95', 'X_tsne'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
No description has been provided for this image

HCC1806 Smart Seq¶

Also for HCC, two regions seem to emerge, though they are not as distinct or well-defined as in the other cell line. Notably, we had to manually adjust the perplexity parameter to 20(instead of standard 30) to achieve a visually interpretable result.

In [152]:
t_sne(adata_ss_hcc, title="HCC1806 Smart Seq unscaled", perplexity=20)
Out[152]:
AnnData object with n_obs × n_vars = 182 × 3000
    obs: 'condition'
    uns: 'pca', 'neighbors', 'tsne', 'condition_colors'
    obsm: 'X_pca', 'X_pca_95', 'X_tsne'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
No description has been provided for this image

MCF7 Drop-Seq¶

The effect of preprocessing (log-transforming and scaling the dataset) and reducing the number of principal components to just 10 revealed two distinct subgroups in the t-SNE embedding, corresponding to the biological conditions. This improvement is due to t-SNE's sensitivity to the number of input dimensions; using more than 1000 dimensions would result in a noisy and uninterpretable plot.

In [153]:
t_sne(adata_ds_mcf7_scaled, title="MCF7 Drop Seq scaled - using only 10 components", perplexity=50, use_rep="X_pca", random_state=42)
Out[153]:
AnnData object with n_obs × n_vars = 21626 × 3000
    obs: 'condition'
    uns: 'pca', 'neighbors', 'tsne', 'condition_colors'
    obsm: 'X_pca', 'X_tsne'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
No description has been provided for this image

HCC Drop-Seq¶

The t-SNE embedding of the scaled HCC Drop-Seq dataset reveals two distinct subgroups corresponding to the biological conditions. For the same reason as before, this separation is achieved only after preprocessing the data (log-transforming and scaling) and reducing the number of principal components to 10.

In [154]:
t_sne(adata_ds_hcc_scaled, title="HCC1806 Drop Seq scaled", perplexity=50, use_rep="X_pca")
Out[154]:
AnnData object with n_obs × n_vars = 14682 × 3000
    obs: 'condition'
    uns: 'pca', 'neighbors', 'tsne', 'condition_colors'
    obsm: 'X_pca', 'X_tsne'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
No description has been provided for this image

UMAP¶

UMAP (Uniform Manifold Approximation and Projection) is a powerful non-linear dimensionality reduction technique that captures both local neighborhood structure and global data topology. In our single-cell analysis, we use UMAP to embed high-dimensional gene expression profiles into two dimensions, making it easier to visualize and interpret cell clusters.

To compute the UMAP embedding using the scanpy function sc.tl.umap, we need first to have a k-nn graph that determines n_neighbors, defining how many neighbors each cell considers when building the graph. This cell is meant to be run after the K-NN one.

The min_dist parameter controls how tightly UMAP packs points together in the low-dimensional space. The default value of 0.5 provides a good representation of the conditions in the embedded space, so we left this stadnard choice.

MCF7 SmartSeq

The UMAP embedding closely mirrors the patterns observed in the t-SNE analysis. Using the unscaled data, there is a near-perfect separation into two distinct regions, except for two points, which aligns exactly with the t-SNE results.

In [155]:
sc.tl.umap(
    adata_ss_mcf7,
    min_dist=0.5,
    random_state=42
)
sc.pl.umap(
    adata_ss_mcf7,
    color='condition',
    show=False,
    size=20,
    title="UMAP: MCF7 Smart-seq unscaled"
)
Out[155]:
<Axes: title={'center': 'UMAP: MCF7 Smart-seq unscaled'}, xlabel='UMAP1', ylabel='UMAP2'>
No description has been provided for this image

HCC SmartSeq¶

Consistent with the PCA and t-SNE results, the visual separation is less pronounced compared to the MCF7 cell line. However, a clear gradient is still observable, indicating some level of differentiation between conditions.

In [156]:
sc.tl.umap(
    adata_ss_hcc,
    min_dist=0.5,
    random_state=42
)
sc.pl.umap(
    adata_ss_hcc,
    color='condition',
    show=False,
    size=20,
    title="UMAP: HCC1806 Smart-seq unscaled"
)
Out[156]:
<Axes: title={'center': 'UMAP: HCC1806 Smart-seq unscaled'}, xlabel='UMAP1', ylabel='UMAP2'>
No description has been provided for this image

MCF7 DropSeq¶

UMAP embedding reveals interesting structures in the preprocessed version of the data. Notably, UMAP has a superior ability to preserve the global and local structure of the data, even when working with a high number of components. To maintain consistency with the neighbors graph constructed earlier, the analysis here is still based on the first 10 components.

In [157]:
#scaled version
sc.tl.umap(
    adata_ds_mcf7_scaled,
    min_dist=0.5,
    random_state=42
)
sc.pl.umap(
    adata_ds_mcf7_scaled,
    color='condition',
    show=False,
    size=10,
    title="UMAP: MCF7 Drop-seq scaled"
)
Out[157]:
<Axes: title={'center': 'UMAP: MCF7 Drop-seq scaled'}, xlabel='UMAP1', ylabel='UMAP2'>
No description has been provided for this image

HCC DropSeq¶

For this cell line, the interpretability of the plot is less clear compared to MCF7. However, it appears that hypoxic cells are positioned above and below a stripe of cells labeled as normoxic, suggesting some level of separation between the conditions.

In [158]:
# scaled version
sc.tl.umap(
    adata_ds_hcc_scaled,
    min_dist=0.5,
    random_state=42
)
sc.pl.umap(
    adata_ds_hcc_scaled,
    color='condition',
    show=False,
    size=10,
    title="UMAP: HCC1806 Drop-seq scaled"
)
Out[158]:
<Axes: title={'center': 'UMAP: HCC1806 Drop-seq scaled'}, xlabel='UMAP1', ylabel='UMAP2'>
No description has been provided for this image

K-Means Clustering¶

To choose an appropriate number of clusters (k), we applied K-means to the PCA embedding that captures 95 % of the variance. We evaluated both inertia and average silhouette scores for various parameters of k. In most cases the peak value happens to be k = 2, which is what we would expect. In some cases other values looked plausible so we decided to plot them as well. In the end we evaluate clusters comparing them to ground-truth labels using metrics like the ARI, NMI and cluster purity.

When we project those two clusters back into our principal components (PC1 vs. PC2, etc.), and into UMAP or t-SNE space, the resulting partition cleanly separates hypoxic from normoxic cells only in the MCF-7 SmartSeq dataset. In other cell-lines/sampling methods, the clusters overlap substantially and fail to track our desired condition or there is the need of more than 2 clusters to separate hypoxia an normoxia.

This uniquely clear split in the MCF-7 SmartSeq data likely reflects a combination of cell-line consistency under hypoxia and the high sensitivity of the SmartSeq protocol in capturing those changes.

helper functions¶

This cell defines three key helper functions for performing and visualizing K-Means clustering on single-cell data, typically after dimensionality reduction (e.g., PCA):

  1. kmeans_optimization
    Purpose:
    Runs K-Means clustering on a chosen data representation visual or rep_key (e.g., the first N principal components), for a range of cluster numbers (k_range).
    What it does:

    • Fits K-Means for each value of k in the specified range.
    • Computes and stores the inertia (within-cluster sum of squares) and silhouette score (a measure of cluster separation) for each k.
    • Identifies the best k (the one with the highest silhouette score).
    • Stores the clustering results and labels in the AnnData object.
    • Plots the inertia ("elbow plot") and silhouette scores to help visually select the optimal number of clusters.
  2. silhouette_diagrams
    Purpose:
    Visualizes the quality of clustering for different values of k using silhouette plots.
    What it does:

    • For each k, computes the silhouette coefficient for every cell (how well each cell fits within its cluster).
    • Plots the silhouette diagram for each k, showing the distribution of silhouette scores per cluster.
    • Helps assess which k yields the most coherent and well-separated clusters.
  3. plot_kmeans_clusters
    Purpose:
    Visualizes the clustering results in low-dimensional space (UMAP, t-SNE, or PCA).
    What it does:

    • Plots the best K-Means clustering (by passing manually the parameter k, default is the maximizer of the silhouette score) side-by-side with the ground-truth biological condition (hypoxia/normoxia).
    • Supports visualization in UMAP, t-SNE, or PCA space.
    • Allows direct comparison between unsupervised clusters and known biological labels.
  4. evaluate_clustering
    Purpose:
    Evaluates clustering results against ground-truth labels using external metrics and contingency tables.
    What it does:

    • Computes Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) to quantify clustering quality.
    • Generates a contingency table showing the overlap between true labels and cluster assignments.
    • Calculates row-wise and column-wise percentages for the contingency table.
    • Computes purity per cluster and overall purity as additional metrics.
    • Prints all results for easy interpretation.
In [258]:
def kmeans_optimization(
    adata,
    rep_key: str = 'X_pca_95',
    visual: str = 'pca',
    k_range: range = range(2, 10),
    random_state: int = 42
):
    """
    Optimize KMeans on a given embedding, store results in adata.uns and adata.obs.

    Parameters
    ----------
    adata: AnnData
        Must contain adata.obsm[rep_key] for clustering.
    rep_key: str
        Key in adata.obsm to cluster on (e.g. 'X_pca', 'X_pca_95', 'X_umap').
    visual: str
        Identifier under which results will be stored in adata.uns['kmeans'].
    k_range: range
        Range of k values to evaluate (n_clusters).
    random_state: int
        Random seed for reproducibility.

    Effects
    -------
    - Populates adata.uns['kmeans'][visual] = {
          'k_range': list(k_range),
          'inertia': [...],
          'silhouette': [...],
          'best_k': int,
          'best_score': float,
          'labels_key': str
      }
    - Stores best KMeans labels in adata.obs under key provided by 'labels_key'.
    """
    # prepare storage
    if 'kmeans' not in adata.uns:
        adata.uns['kmeans'] = {}
    results = {}

    # extract data for clustering
    X = adata.obsm.get(rep_key)
    if X is None:
        raise KeyError(f"adata.obsm['{rep_key}'] not found")

    inertias = []
    silhouettes = []
    models = []

    # fit and evaluate
    for k in k_range:
        model = KMeans(n_clusters=k, n_init=10, random_state=random_state).fit(X)
        inertias.append(model.inertia_)
        if k > 1:
            silhouettes.append(silhouette_score(X, model.labels_))
        else:
            silhouettes.append(np.nan)
        models.append(model)

    # determine best k by silhouette
    silhouettes_np = np.array(silhouettes)
    # ignore first nan
    best_idx = np.nanargmax(silhouettes_np)
    best_k = k_range[best_idx]
    best_score = silhouettes_np[best_idx]
    best_labels = models[best_idx].labels_.astype(str)
    labels_key = f'kmeans_{visual}'

    # store in adata
    results['k_range'] = list(k_range)
    results['inertia'] = inertias
    results['silhouette'] = silhouettes
    results['best_k'] = int(best_k)
    results['best_score'] = float(best_score)
    results['labels_key'] = labels_key

    adata.uns['kmeans'][visual] = results
    adata.obs[labels_key] = best_labels
    
    # plot inertia & silhouette
    fig, axs = plt.subplots(1, 2, figsize=(12, 4))
    axs[0].plot(list(k_range), inertias, 'bo-')
    axs[0].set_xlabel('k'), axs[0].set_ylabel('Inertia'), axs[0].set_title(f'Elbow Plot ({visual})')
    axs[1].plot(list(k_range), silhouettes, 'bo-')
    axs[1].set_xlabel('k'), axs[1].set_ylabel('Silhouette'), axs[1].set_title(f'Silhouette Scores ({visual})')
    plt.tight_layout()
    plt.show()

    return best_k

def silhouette_diagrams(
    adata,
    rep_key: str = 'X_pca_95',
    visual: str = 'pca',
    k_range: range = range(2, 7),
    dataset_name: str = 'Dataset'
):
    """
    Plot silhouette diagrams for KMeans clusters on a given embedding.

    Parameters
    ----------
    adata: AnnData
        Must contain adata.obsm[rep_key].
    rep_key: str
        Key in adata.obsm for clustering.
    visual: str
        Identifier used for titles and result storage.
    k_range: range
        Values of k to evaluate.
    dataset_name: str
        Name used in subplot titles.

    Effects
    -------
    - Stores silhouette_scores dict in adata.uns['kmeans'][visual]['silhouette_details']
    """
    X = adata.obsm.get(rep_key)
    if X is None:
        raise KeyError(f"adata.obsm['{rep_key}'] not found")

    k_list = list(k_range)
    models = [KMeans(n_clusters=k, n_init=10, random_state=42).fit(X) for k in k_list]
    sil_scores = [silhouette_score(X, m.labels_) for m in models]

    # Prepare plots
    n_plots = len(k_list)
    n_cols = 3
    n_rows = math.ceil(n_plots / n_cols)
    plt.figure(figsize=(5 * n_cols, 4 * n_rows))

    details = {}
    for idx, (k, model) in enumerate(zip(k_list, models)):
        ax = plt.subplot(n_rows, n_cols, idx + 1)
        labels = model.labels_
        coeffs = silhouette_samples(X, labels)
        details[k] = coeffs

        padding = len(X) // 30
        pos = padding
        ticks = []
        for cluster in range(k):
            c_vals = np.sort(coeffs[labels == cluster])
            ax.fill_betweenx(
                np.arange(pos, pos + len(c_vals)),
                0, c_vals, alpha=0.7
            )
            ticks.append(pos + len(c_vals)/2)
            pos += len(c_vals) + padding

        ax.yaxis.set_major_locator(FixedLocator(ticks))
        ax.yaxis.set_major_formatter(FixedFormatter(range(k)))
        ax.axvline(x=sil_scores[idx], color='red', linestyle='--')
        ax.set_title(f"{dataset_name} — k={k}")
        if idx % n_cols == 0:
            ax.set_ylabel('Cluster')
        if idx >= (n_rows-1)*n_cols:
            ax.set_xlabel('Silhouette Coefficient')
        else:
            plt.setp(ax.get_xticklabels(), visible=False)

    plt.tight_layout()
    plt.show()

    # store details
    adata.uns['kmeans'][visual]['silhouette_details'] = details
    return dict(zip(k_list, sil_scores))

def plot_kmeans_clusters(
    adata,
    k: int,
    rep_key: str,
    embed: str = 'umap',
    embed_key: str = "",
    pca_dims: tuple = (0, 1),
    random_state: int = 42,
    size: int = 10,
    dataset_name: str = "Dataset",
    consistent_colors: bool = True
):
    """
    Plot KMeans clusters for a single user-specified k value and ground truth conditions side-by-side on UMAP/TSNE or PCA.

    Parameters
    ----------
    adata: AnnData
        Annotated data matrix.
    k: int
        Number of clusters for KMeans.
    rep_key: str
        Key in adata.obsm to cluster on.
    embed: str
        Embedding type ('umap', 'tsne', or 'pca').
    embed_key: str
        Key in adata.obsm for embedding coordinates.
    pca_dims: tuple
        Dimensions to use for PCA plots.
    random_state: int
        Random seed for reproducibility.
    size: int
        Marker size for scatter plots.
    dataset_name: str
        Name of the dataset to include in plot titles.
    consistent_colors: bool
        Whether to use consistent colors across plots for clusters.
    """
    embed_key = embed_key or f'X_{embed}'
    X = adata.obsm.get(rep_key)
    if X is None:
        raise KeyError(f"adata.obsm['{rep_key}'] not found")

    # Run KMeans with user-specified k
    model = KMeans(n_clusters=k, n_init=10, random_state=random_state).fit(X)
    labels_key = f'kmeans_k{k}'
    adata.obs[labels_key] = model.labels_.astype(str)

    if labels_key not in adata.obs:
        raise KeyError(f"adata.obs['{labels_key}'] not found")

    fig, axes = plt.subplots(1, 2, figsize=(12, 6))

    if embed in ('umap', 'tsne'):
        coords = adata.obsm.get(embed_key)
        if coords is None:
            raise KeyError(f"adata.obsm['{embed_key}'] not found")
        # left: KMeans clusters
        ax = axes[0]
        sc.pl.embedding(
            adata,
            basis=embed,
            color=[labels_key],
            title=f'{dataset_name} — {embed.upper()} KMeans (k={k})',
            show=False,
            ax=ax,
            size=size,
            palette="tab20" if consistent_colors else None,
        )
        # right: true condition
        ax = axes[1]
        sc.pl.embedding(
            adata,
            basis=embed,
            color=['condition'],
            title=f'{dataset_name} — {embed.upper()} True Condition',
            show=False,
            ax=ax,
            size=size,
        )
    elif embed == 'pca':
        pcs = adata.obsm.get(rep_key)
        if pcs is None:
            raise KeyError(f"adata.obsm['{rep_key}'] not found for PCA plot")
        x, y = pca_dims
        # left: KMeans
        ax = axes[0]
        scatter = ax.scatter(
            pcs[:, x], pcs[:, y],
            c=adata.obs[labels_key].astype(int),
            cmap='Set2' if consistent_colors else 'viridis',
            s=size, alpha=0.8
        )
        ax.set_xlabel(f'PC{x+1}'), ax.set_ylabel(f'PC{y+1}')
        ax.set_title(f'{dataset_name} — PCA KMeans (k={k}) — PCs {x+1} vs {y+1}')
        handles, _ = scatter.legend_elements()
        ax.legend(handles, [f'Cluster {i}' for i in range(k)], title='Cluster')
        # right: true condition
        ax = axes[1]
        scatter = ax.scatter(
            pcs[:, x], pcs[:, y],
            c=adata.obs['condition'].astype('category').cat.codes,
            cmap='Set2', s=size, alpha=0.8
        )
        ax.set_xlabel(f'PC{x+1}'), ax.set_ylabel(f'PC{y+1}')
        ax.set_title(f'{dataset_name} — PCA True Condition')
        handles, _ = scatter.legend_elements()
        ax.legend(handles, adata.obs['condition'].cat.categories, title='Condition')
    else:
        raise ValueError("embed must be 'umap', 'tsne', or 'pca'")

    plt.tight_layout()
    plt.show()

def evaluate_clustering(true_labels, cluster_labels, method_name="Clustering"):
    """
    Evaluate clustering against ground-truth labels, including confusion matrix
    with both raw counts and row-/column-wise percentages.

    Parameters
    ----------
    true_labels : array-like
        Ground-truth class labels.
    cluster_labels : array-like
        Cluster assignments.
    method_name : str
        Name of the clustering method for printouts.

    Returns
    -------
    results : dict
        {
            'ARI': float,
            'NMI': float,
            'contingency': pd.DataFrame,
            'row_pct': pd.DataFrame,
            'col_pct': pd.DataFrame,
            'purity_per_cluster': pd.Series,
            'overall_purity': float
        }
    """
    # External metrics
    ari = adjusted_rand_score(true_labels, cluster_labels)
    nmi = normalized_mutual_info_score(true_labels, cluster_labels)
    
    # Contingency table
    ct = pd.crosstab(
        pd.Series(true_labels, name="True"),
        pd.Series(cluster_labels, name="Cluster")
    )

    # Percentages
    row_pct = ct.div(ct.sum(axis=1), axis=0) * 100
    col_pct = ct.div(ct.sum(axis=0), axis=1) * 100

    # Purity calculations
    purity_per_cluster = ct.max(axis=0) / ct.sum(axis=0)
    overall_purity = ct.values.max(axis=1).sum() / ct.values.sum()

    # Print results
    print(f"\n=== {method_name} Evaluation ===")
    print(f"ARI: {ari:.4f}    NMI: {nmi:.4f}\n")
    
    print("Contingency Table (raw counts):")
    print(ct, "\n")
    
    print("Row-wise percentages (each true class → clusters):")
    print(row_pct.round(1).astype(str) + "%", "\n")
    
    print("Column-wise percentages (each cluster ← true classes):")
    print(col_pct.round(1).astype(str) + "%", "\n")
    
    print("Purity per cluster:")
    print(purity_per_cluster.to_frame(name="Purity"), "\n")
    print(f"Overall purity: {overall_purity:.4f}\n")

    # Return everything for further inspection
    return {
        'ARI': ari,
        'NMI': nmi,
        'contingency': ct,
        'row_pct': row_pct,
        'col_pct': col_pct,
        'purity_per_cluster': purity_per_cluster,
        'overall_purity': overall_purity
    }

MCF7 Smart Seq¶

The plots clearly show that the silhouette score peaks at two clusters. The silhouette diagrams further confirm this by displaying well-balanced cluster sizes with minimal negative silhouette values, indicating a strong and consistent clustering structure.
While an elbow might be observed at higher values of k, it is unnecessary to consider them, as the cluster-to-condition comparison for k = 2 across PCA, UMAP, and t-SNE spaces already demonstrates an almost perfect match for both scaled and unscaled version. With k = 2 Kmeans clustering achieved a level of overall purity of 0.9720, which is an almost pefect result.

In [257]:
kmeans_optimization(adata_ss_mcf7, visual="pca_95")
silhouette_diagrams(adata_ss_mcf7, k_range=range(2, 7), dataset_name="Smart-seq MCF7", visual="pca_95")
No description has been provided for this image
No description has been provided for this image
Out[257]:
{2: np.float32(0.49936017),
 3: np.float32(0.4744652),
 4: np.float32(0.46488273),
 5: np.float32(0.4413419),
 6: np.float32(0.42580906)}
In [259]:
plot_kmeans_clusters(
    adata_ss_mcf7,
    k = 2,
    embed='umap',
    rep_key='X_pca_95',
    size=30,
    dataset_name="Smart-seq MCF7"
)

print("=========================")
plot_kmeans_clusters(
    adata_ss_mcf7,
    k = 2,
    embed='tsne',
    rep_key='X_pca_95',
    size=30,
    dataset_name="Smart-seq MCF7"
)
print("=========================")

plot_kmeans_clusters(
    adata_ss_mcf7,
    k = 2,
    embed='pca',
    rep_key='X_pca_95',
    pca_dims=(0, 5),
    size=30,
    dataset_name="Smart-seq MCF7"
)
No description has been provided for this image
=========================
No description has been provided for this image
=========================
No description has been provided for this image
In [261]:
evaluate_clustering(
    adata_ss_mcf7.obs['condition'],
    adata_ss_mcf7.obs['kmeans_k2'],
    method_name="Smart-seq MCF7 KMeans (k=2)"
)
print("\n")
=== Smart-seq MCF7 KMeans (k=2) Evaluation ===
ARI: 0.8907    NMI: 0.8430

Contingency Table (raw counts):
Cluster    0    1
True             
Hypo     117    7
Norm       0  126 

Row-wise percentages (each true class → clusters):
Cluster      0       1
True                  
Hypo     94.4%    5.6%
Norm      0.0%  100.0% 

Column-wise percentages (each cluster ← true classes):
Cluster       0      1
True                  
Hypo     100.0%   5.3%
Norm       0.0%  94.7% 

Purity per cluster:
           Purity
Cluster          
0        1.000000
1        0.947368 

Overall purity: 0.9720



HCC Smart Seq¶

Although the mean silhouette score reaches its maximum at k = 2, the detailed silhouette plots for k = 3–6 show more uniformly high widths across all clusters. This suggests that beyond a simple hypoxic vs. normoxic dichotomy, the HCC data may harbor three or more distinct subgroups—potentially corresponding to different hypoxia responses or other biological states within each condition.
The evaluation of these clusters is very poor for k = 2, with ARI: -0.0042 and NMI: 0.0000. The results improved for k = 3 (ARI: 0.5161, NMI: 0.4844), but the third cluster seems to have captured cells that lie on the border of the two conditions. It might represent biologically cells that are in a transition state and have not fully developed hypoxia yet.

In [262]:
kmeans_optimization(adata_ss_hcc, visual="pca_95")
silhouette_diagrams(adata_ss_hcc, k_range=range(2, 8), dataset_name="Smart-seq HCC1806", visual="pca_95")
No description has been provided for this image
No description has been provided for this image
Out[262]:
{2: np.float32(0.2634671),
 3: np.float32(0.16170451),
 4: np.float32(0.16963938),
 5: np.float32(0.18061545),
 6: np.float32(0.190826),
 7: np.float32(0.17456625)}
In [263]:
plot_kmeans_clusters(
    adata_ss_hcc,
    rep_key='X_pca_95',
    k = 2,
    embed='tsne',
    size=50,
    dataset_name="Smart-seq HCC1806"
)
print("==========================")

plot_kmeans_clusters(
    adata_ss_hcc,
    rep_key='X_pca_95',
    k = 2,
    embed='pca',
    pca_dims=(1, 2),
    size=30,
    dataset_name="Smart-seq HCC1806"
)
No description has been provided for this image
==========================
No description has been provided for this image
In [266]:
evaluate_clustering(
    adata_ss_hcc.obs['condition'],
    adata_ss_hcc.obs['kmeans_k2'],
    method_name="KMeans Clustering (HCC) k=2"
)
print("\n")
=== KMeans Clustering (HCC) k=2 Evaluation ===
ARI: -0.0042    NMI: 0.0000

Contingency Table (raw counts):
Cluster   0   1
True           
Hypo     28  69
Norm     25  60 

Row-wise percentages (each true class → clusters):
Cluster      0      1
True                 
Hypo     28.9%  71.1%
Norm     29.4%  70.6% 

Column-wise percentages (each cluster ← true classes):
Cluster      0      1
True                 
Hypo     52.8%  53.5%
Norm     47.2%  46.5% 

Purity per cluster:
           Purity
Cluster          
0        0.528302
1        0.534884 

Overall purity: 0.7088




In [267]:
plot_kmeans_clusters(
    adata_ss_hcc,
    rep_key='X_pca_95',
    k = 3,
    embed='umap',
    size=50,
    dataset_name="Smart-seq HCC1806"
)
print("==========================")
plot_kmeans_clusters(
    adata_ss_hcc,
    rep_key='X_pca_95',
    k = 3,
    embed='pca',
    pca_dims=(1, 2),
    size=30,
    dataset_name="Smart-seq HCC1806"
)
No description has been provided for this image
==========================
No description has been provided for this image
In [268]:
evaluate_clustering(
    adata_ss_hcc.obs['condition'],
    adata_ss_hcc.obs['kmeans_k3'],
    method_name="KMeans Clustering (HCC) k =3"
)
print("\n")
=== KMeans Clustering (HCC) k =3 Evaluation ===
ARI: 0.5161    NMI: 0.4844

Contingency Table (raw counts):
Cluster   0   1   2
True               
Hypo     18   7  72
Norm     19  66   0 

Row-wise percentages (each true class → clusters):
Cluster      0      1      2
True                        
Hypo     18.6%   7.2%  74.2%
Norm     22.4%  77.6%   0.0% 

Column-wise percentages (each cluster ← true classes):
Cluster      0      1       2
True                         
Hypo     48.6%   9.6%  100.0%
Norm     51.4%  90.4%    0.0% 

Purity per cluster:
           Purity
Cluster          
0        0.513514
1        0.904110
2        1.000000 

Overall purity: 0.7582



MCF7 Drop Seq¶

The optimal number of clusters, based on the silhouette score optimization, is k = 3. However, the resulting clusters do not align well with the desired biological conditions. The evaluation metrics are ARI: 0.3221 and NMI: 0.3356. Inspecting the contingency tables reveals that cluster 1 exhibits the most overlap, while the other clusters perform better in terms of purity. This observation is further supported by the PCA plot, which highlights the evident discrepancies. This again, could be caused by the fact that some cells are in a transition state between the two conditions.

In [168]:
kmeans_optimization(adata_ds_mcf7_scaled, rep_key="X_pca")
silhouette_diagrams(adata_ds_mcf7_scaled, k_range=range(2, 7), dataset_name="Drop-seq MCF7", rep_key="X_pca")
No description has been provided for this image
No description has been provided for this image
Out[168]:
{2: np.float32(0.24377671),
 3: np.float32(0.26212752),
 4: np.float32(0.24349803),
 5: np.float32(0.18707642),
 6: np.float32(0.18642579)}
In [250]:
# scaled version
plot_kmeans_clusters(
    adata_ds_mcf7_scaled,
    rep_key='X_pca',
    embed='tsne',
    size=10,
    k=3,
    dataset_name = "Drop-seq MCF7"
)
print("==========================")

plot_kmeans_clusters(
    adata_ds_mcf7_scaled,
    embed='umap',
    size=10,
    rep_key='X_pca',
    k=3,
    dataset_name="Drop-seq MCF7"
)
print("==========================")

plot_kmeans_clusters(
    adata_ds_mcf7_scaled,
    embed='pca',
    rep_key='X_pca',
    pca_dims=(0,2),
    size=10,
    k=3,
    dataset_name="Drop-seq MCF7"
)
No description has been provided for this image
==========================
No description has been provided for this image
==========================
No description has been provided for this image
In [170]:
evaluate_clustering(
    adata_ds_mcf7_scaled.obs['condition'],
    adata_ds_mcf7_scaled.obs['kmeans_k3'],
    method_name="Drop-seq MCF7 KMeans (k=3)"
)
print("\n")
=== Drop-seq MCF7 KMeans (k=3) Evaluation ===
ARI: 0.3221    NMI: 0.3356

Contingency Table (raw counts):
Cluster     0     1     2
True                     
Hypo     5753  2603   565
Norm      160  9065  3480 

Row-wise percentages (each true class → clusters):
Cluster      0      1      2
True                        
Hypo     64.5%  29.2%   6.3%
Norm      1.3%  71.3%  27.4% 

Column-wise percentages (each cluster ← true classes):
Cluster      0      1      2
True                        
Hypo     97.3%  22.3%  14.0%
Norm      2.7%  77.7%  86.0% 

Purity per cluster:
           Purity
Cluster          
0        0.972941
1        0.776911
2        0.860321 

Overall purity: 0.6852



HCC Drop Seq¶

The silhouette score peaks at k=2, and the silhouette diagram for k=2 indicates well-separated and balanced clusters. However, the resulting clusters do not correspond to the expected biological conditions when visualized across different spaces.

In [171]:
kmeans_optimization(adata_ds_hcc_scaled, rep_key="X_pca")
silhouette_diagrams(adata_ds_hcc_scaled, k_range=range(2, 8), dataset_name="Drop-seq HCC1806 scaled", rep_key="X_pca")
No description has been provided for this image
No description has been provided for this image
Out[171]:
{2: np.float32(0.2080853),
 3: np.float32(0.17907955),
 4: np.float32(0.17723405),
 5: np.float32(0.1760093),
 6: np.float32(0.17995057),
 7: np.float32(0.16560963)}
In [251]:
#scaled version
plot_kmeans_clusters(
    adata_ds_hcc_scaled,
    rep_key='X_pca',
    embed='umap',
    size=10,
    k=2,
    dataset_name="Drop-seq HCC1806"
)
plot_kmeans_clusters(
    adata_ds_hcc_scaled,
    embed='tsne',
    size=10,
    rep_key='X_pca',
    k=2,
    dataset_name="Drop-seq HCC1806"
)
plot_kmeans_clusters(
    adata_ds_hcc_scaled,
    embed='pca',
    rep_key='X_pca',
    pca_dims=(2,3),
    size=10,
    k=2,
    dataset_name="Drop-seq HCC1806"
)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Hierarchical Clustering¶

Hierarchical clustering is an alternative to K-Means that builds a tree-like structure of nested clusters.

There are two main types of hierarchical clustering:

  • Agglomerative Clustering: A "bottom-up" approach where each data point starts as its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
  • Divisive Clustering: A "top-down" approach where all data points start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Agglomerative clustering is more commonly used and is implemented in libraries like scikit-learn. It allows for different linkage criteria, such as:

  • Single Linkage: Minimum distance between points in two clusters.
  • Complete Linkage: Maximum distance between points in two clusters.
  • Average Linkage: Average distance between all points in two clusters.
  • Ward's Linkage: Minimizes the variance within clusters.

For the purpose of our analysis, we will use only agglomerative clustering, using Ward's linkage. This method is suitable for our dataset as it minimizes the variance within clusters, ensuring that the resulting clusters are compact and well-separated.

helper functions¶

In [231]:
    # ------------------------------------------    
    # Step 1: Use PCA-reduced data from Scanpy
    # ------------------------------------------
def plot_dendrogram(adata, use_rep='X_pca_95', title="title"):
    # Check if PCA has been computed
    X = adata.obsm[use_rep]  # MCF7 SmartSeq PCA coords

    # ------------------------------------------
    # Step 2: Plot dendrogram to visualize hierarchy
    # ------------------------------------------
    linked = linkage(X, method='ward')

    plt.figure(figsize=(12, 6))
    dendrogram(
        linked,
        orientation='top',
        distance_sort='descending',
        show_leaf_counts=False,
        truncate_mode='level',
        p=30
    )
    plt.title(f'Hierarchical Clustering Dendrogram for {title}')
    plt.xlabel('Cells (truncated)')
    plt.ylabel('Distance')
    plt.grid(True)
    plt.tight_layout()
    plt.show()

    # ------------------------------------------
    # Step 3: Run Agglomerative Clustering
    # ------------------------------------------

def run_agglo(adata, cut, components="1,2", use_rep='X_pca_95', title=""):
    # Check if PCA has been computed
    X = adata.obsm[use_rep]  # MCF7 SmartSeq PCA coords

    # Perform Agglomerative Clustering
    agglo = AgglomerativeClustering(n_clusters=cut, linkage='ward')
    hc_labels = agglo.fit_predict(X)

    adata.obs['hc_clusters'] = hc_labels.astype(str)

    # ------------------------------------------
    # Step 4: Plot PCA and UMAP colored by cluster vs. condition
    # ------------------------------------------
    fig, axes = plt.subplots(2, 2, figsize=(16, 14))

    # Top row: PCA
    sc.pl.pca(
        adata,
        color='hc_clusters',
        ax=axes[0, 0],
        show=False,
        size=50,
        components=components
    )
    axes[0, 0].set_title(f"PCA: Agglomerative Clusters for {title}, {cut} clusters")

    sc.pl.pca(
        adata,
        color='condition',
        ax=axes[0, 1],
        show=False,
        size=50,
        components=components
    )
    axes[0, 1].set_title(f"PCA: Original Condition for {title}, {cut} clusters")
    # Bottom row: UMAP
    sc.pl.umap(
        adata,
        color='hc_clusters',
        ax=axes[1, 0],
        show=False,
        size=50
    )
    axes[1, 0].set_title(f'UMAP: Agglomerative Clusters for {title}, {cut} clusters')

    sc.pl.umap(
        adata,
        color='condition',
        ax=axes[1, 1],
        show=False,
        size=50
    )
    axes[1, 1].set_title(f'UMAP: Original Condition for {title}, {cut} clusters')

    plt.tight_layout()
    plt.show()

MCF7 Smart Seq¶

Using the agglomerative clustering technique, the clusters identified for this dataset are highly consistent. The dendrogram reveals a clear separation with two distinct branches, which correspond accurately to the biological conditions. The overall purity achieved is 0.9840, surpassing the results obtained with K-Means clustering.

In [232]:
plot_dendrogram(adata_ss_mcf7, use_rep='X_pca_95', title="MCF7 SmartSeq")
run_agglo(adata_ss_mcf7, cut=2, components="1,2", use_rep='X_pca_95', title="MCF7 SmartSeq")
No description has been provided for this image
No description has been provided for this image
In [233]:
evaluate_clustering(
    adata_ss_mcf7.obs['condition'],
    adata_ss_mcf7.obs['hc_clusters'],
    method_name="Smart-seq HCC1806 Agglomerative Clustering (k=2)"
)
print("\n")
=== Smart-seq HCC1806 Agglomerative Clustering (k=2) Evaluation ===
ARI: 0.9368    NMI: 0.8974

Contingency Table (raw counts):
Cluster    0    1
True             
Hypo     124    0
Norm       4  122 

Row-wise percentages (each true class → clusters):
Cluster       0      1
True                  
Hypo     100.0%   0.0%
Norm       3.2%  96.8% 

Column-wise percentages (each cluster ← true classes):
Cluster      0       1
True                  
Hypo     96.9%    0.0%
Norm      3.1%  100.0% 

Purity per cluster:
          Purity
Cluster         
0        0.96875
1        1.00000 

Overall purity: 0.9840



HCC Smart Seq¶

Consistent with the findings from K-Means clustering, it is challenging to split the HCC SmartSeq dataset into two distinct clusters. In this cell, we plot the dendrogram and observe that the largest jump suggests two clusters. However, one could also argue that six clusters might be a reasonable choice, so we inspect both scenarios.

For two clusters, the results are suboptimal, with low ARI and NMI scores. When using six clusters, both ARI and NMI scores improve (ARI: 0.3229, NMI: 0.3799), but the values remain relatively low. This is expected, as these metrics tend to perform worse when the number of clusters exceeds the number of biological conditions. The six-cluster solution may capture additional subgroups, potentially representing cells in transitional states or other biological variations within the dataset.

In [273]:
plot_dendrogram(adata_ss_hcc, use_rep='X_pca_95', title="HCC1806 SmartSeq")
run_agglo(adata_ss_hcc, cut=2, components='2,3', use_rep='X_pca_95', title="HCC1806 SmartSeq")
No description has been provided for this image
No description has been provided for this image
In [275]:
evaluate_clustering(
    adata_ss_hcc.obs['condition'],
    adata_ss_hcc.obs['hc_clusters'],
    method_name="Smart-seq HCC1806 Agglomerative Clustering (k=2)"
)
print("\n")
=== Smart-seq HCC1806 Agglomerative Clustering (k=2) Evaluation ===
ARI: 0.0257    NMI: 0.0201

Contingency Table (raw counts):
Cluster   0   1
True           
Hypo     27  70
Norm     37  48 

Row-wise percentages (each true class → clusters):
Cluster      0      1
True                 
Hypo     27.8%  72.2%
Norm     43.5%  56.5% 

Column-wise percentages (each cluster ← true classes):
Cluster      0      1
True                 
Hypo     42.2%  59.3%
Norm     57.8%  40.7% 

Purity per cluster:
           Purity
Cluster          
0        0.578125
1        0.593220 

Overall purity: 0.6484



In [236]:
run_agglo(adata_ss_hcc, cut=6, components='2,3', use_rep='X_pca_95', title="HCC1806 SmartSeq")
No description has been provided for this image
In [237]:
evaluate_clustering(
    adata_ss_hcc.obs['condition'],
    adata_ss_hcc.obs['hc_clusters'],
    method_name="Smart-seq HCC1806 Agglomerative Clustering (k=6)"
)
print("\n")
=== Smart-seq HCC1806 Agglomerative Clustering (k=6) Evaluation ===
ARI: 0.3229    NMI: 0.3799

Contingency Table (raw counts):
Cluster   0  1  2   3   4  5
True                        
Hypo     25  1  8   4  58  1
Norm     36  1  0  48   0  0 

Row-wise percentages (each true class → clusters):
Cluster      0     1     2      3      4     5
True                                          
Hypo     25.8%  1.0%  8.2%   4.1%  59.8%  1.0%
Norm     42.4%  1.2%  0.0%  56.5%   0.0%  0.0% 

Column-wise percentages (each cluster ← true classes):
Cluster      0      1       2      3       4       5
True                                                
Hypo     41.0%  50.0%  100.0%   7.7%  100.0%  100.0%
Norm     59.0%  50.0%    0.0%  92.3%    0.0%    0.0% 

Purity per cluster:
           Purity
Cluster          
0        0.590164
1        0.500000
2        1.000000
3        0.923077
4        1.000000
5        1.000000 

Overall purity: 0.5824



MCF7 DropSeq¶

Upon inspecting the dendrograms, two and three clusters emerge as reasonable choices for identifying distinct groups. For two clusters, all evaluation metrics are superior. Notably, in the UMAP plot, the two cluster "1" contains most of the values labelled as normoxia, but it also contains a decent chunk of values of the wrong condition.

Adding a third cluster does not improve the metrics and instead results in a decline in performance for ARI and NMI scores. The additional cluster appears in a region already well-defined by the initial two clusters, offering no new insights and reducing the overall performance by both ARI and NMI.

In [238]:
plot_dendrogram(adata_ds_mcf7_scaled, use_rep='X_pca', title="MCF7 DropSeq")
No description has been provided for this image
In [239]:
run_agglo(adata_ds_mcf7_scaled, cut=2, components='1,3', use_rep='X_pca', title="MCF7 DropSeq")
No description has been provided for this image
In [240]:
evaluate_clustering(
    adata_ds_mcf7_scaled.obs['condition'],
    adata_ds_mcf7_scaled.obs['hc_clusters'],
    method_name="Drop-seq MCF7 Agglomerative Clustering (k=2)"
)
print("\n")
=== Drop-seq MCF7 Agglomerative Clustering (k=2) Evaluation ===
ARI: 0.5094    NMI: 0.4794

Contingency Table (raw counts):
Cluster      0     1
True                
Hypo      2984  5937
Norm     12616    89 

Row-wise percentages (each true class → clusters):
Cluster      0      1
True                 
Hypo     33.4%  66.6%
Norm     99.3%   0.7% 

Column-wise percentages (each cluster ← true classes):
Cluster      0      1
True                 
Hypo     19.1%  98.5%
Norm     80.9%   1.5% 

Purity per cluster:
           Purity
Cluster          
0        0.808718
1        0.985231 

Overall purity: 0.8579




In [241]:
run_agglo(adata_ds_mcf7_scaled, cut=3, components='1,3', use_rep='X_pca' , title="MCF7 DropSeq")
No description has been provided for this image
In [242]:
evaluate_clustering(
    adata_ds_mcf7_scaled.obs['condition'],
    adata_ds_mcf7_scaled.obs['hc_clusters'],
    method_name="Drop-seq MCF7 Agglomerative Clustering (k=3)"
)
print("\n")
=== Drop-seq MCF7 Agglomerative Clustering (k=3) Evaluation ===
ARI: 0.3801    NMI: 0.4052

Contingency Table (raw counts):
Cluster      0     1     2
True                      
Hypo      2906  5937    78
Norm     10479    89  2137 

Row-wise percentages (each true class → clusters):
Cluster      0      1      2
True                        
Hypo     32.6%  66.6%   0.9%
Norm     82.5%   0.7%  16.8% 

Column-wise percentages (each cluster ← true classes):
Cluster      0      1      2
True                        
Hypo     21.7%  98.5%   3.5%
Norm     78.3%   1.5%  96.5% 

Purity per cluster:
           Purity
Cluster          
0        0.782891
1        0.985231
2        0.964786 

Overall purity: 0.7591



HCC DropSeq¶

For this cell line, the dendrograms suggest reasonable cluster numbers of 2 and 4. We evaluate both scenarios. While the ARI is low, the 4-cluster solution shows interesting results, as it appears to represent subgroups predicting hypoxia or normoxia. In contrast, the 2-cluster solution lacks a clear biological interpretation in the context of our analysis.

In [243]:
plot_dendrogram(adata_ds_hcc_scaled, use_rep="X_pca", title="HCC1806 DropSeq")
No description has been provided for this image
In [244]:
run_agglo(adata_ds_hcc_scaled, cut=2, components="3,4", use_rep='X_pca', title="HCC1806 DropSeq")
No description has been provided for this image
In [245]:
evaluate_clustering(
    adata_ds_hcc_scaled.obs['condition'],
    adata_ds_hcc_scaled.obs['hc_clusters'],
    method_name="Drop-seq HCC1806 Agglomerative Clustering (k=2)"
)
print("\n")
=== Drop-seq HCC1806 Agglomerative Clustering (k=2) Evaluation ===
ARI: 0.0640    NMI: 0.0364

Contingency Table (raw counts):
Cluster     0     1
True               
Hypo     6186  2713
Norm     2741  3042 

Row-wise percentages (each true class → clusters):
Cluster      0      1
True                 
Hypo     69.5%  30.5%
Norm     47.4%  52.6% 

Column-wise percentages (each cluster ← true classes):
Cluster      0      1
True                 
Hypo     69.3%  47.1%
Norm     30.7%  52.9% 

Purity per cluster:
           Purity
Cluster          
0        0.692954
1        0.528584 

Overall purity: 0.6285



In [246]:
run_agglo(adata_ds_hcc_scaled, cut=4, components="3,4", use_rep='X_pca', title="HCC1806 DropSeq")
No description has been provided for this image
In [247]:
evaluate_clustering(
    adata_ds_hcc_scaled.obs['condition'],
    adata_ds_hcc_scaled.obs['hc_clusters'],
    method_name="Drop-seq HCC1806 Agglomerative Clustering (k=4)"
)
print("\n")
=== Drop-seq HCC1806 Agglomerative Clustering (k=4) Evaluation ===
ARI: 0.2156    NMI: 0.2104

Contingency Table (raw counts):
Cluster     0     1     2     3
True                           
Hypo      998  2216  5188   497
Norm     1634   210  1107  2832 

Row-wise percentages (each true class → clusters):
Cluster      0      1      2      3
True                               
Hypo     11.2%  24.9%  58.3%   5.6%
Norm     28.3%   3.6%  19.1%  49.0% 

Column-wise percentages (each cluster ← true classes):
Cluster      0      1      2      3
True                               
Hypo     37.9%  91.3%  82.4%  14.9%
Norm     62.1%   8.7%  17.6%  85.1% 

Purity per cluster:
           Purity
Cluster          
0        0.620821
1        0.913438
2        0.824146
3        0.850706 

Overall purity: 0.5462



Leiden clustering (Scanpy)¶

Leiden clustering is a community detection algorithm commonly used in single-cell analysis to identify clusters of similar cells. It is an improvement over the Louvain algorithm, offering better partition quality and robustness.

In Scanpy, the sc.tl.leiden() function is used to perform Leiden clustering. It requires a k-NN graph, which was computed running sc.pp.neighbors() in the KNN section. Paramters:

  • resolution: Controls the granularity of the clustering. Higher values result in more clusters. In our analysis, the value of resolution was selected in such a way that the number of cluster was minimal but still they were approximately evenly distributed

Leiden clustering is particularly effective for identifying subpopulations in high-dimensional single-cell datasets, making it a powerful tool for exploratory data analysis.

MCF7 Smart Seq¶

As we expected the Leiden clustering results for the MCF7 Smart Seq dataset reveal a clear separation into two clusters, corresponding to hypoxic and normoxic conditions. This is consistent both for scaled and uscaled version. Overall purity is : 0.9920, the best so far for this dataset.

In [192]:
# Using the igraph implementation and a fixed number of iterations can be significantly faster, especially for larger datasets
sc.tl.leiden(adata_ss_mcf7, flavor="igraph", n_iterations=2, random_state=42, resolution=0.1)
sc.pl.umap(
    adata_ss_mcf7,
    color=['leiden', 'condition'],
    show=False,
    size=40,
    title=["Leiden Clustering: MCF7 Smart-seq", "Condition for MCF7 Smart-seq"]
)
Out[192]:
[<Axes: title={'center': 'Leiden Clustering: MCF7 Smart-seq'}, xlabel='UMAP1', ylabel='UMAP2'>,
 <Axes: title={'center': 'Condition for MCF7 Smart-seq'}, xlabel='UMAP1', ylabel='UMAP2'>]
No description has been provided for this image
In [193]:
#plot also in the pca space
sc.pl.pca(
    adata_ss_mcf7,
    color=['leiden', 'condition'],
    show=False,
    size=40,
    components="1,6",
    title=["Leiden Clustering: MCF7 Drop-seq scaled", "Condition for MCF7 Drop-seq scaled"]
)
Out[193]:
[<Axes: title={'center': 'Leiden Clustering: MCF7 Drop-seq scaled'}, xlabel='PC1', ylabel='PC6'>,
 <Axes: title={'center': 'Condition for MCF7 Drop-seq scaled'}, xlabel='PC1', ylabel='PC6'>]
No description has been provided for this image
In [194]:
evaluate_clustering(
    adata_ss_mcf7.obs['condition'],
    adata_ss_mcf7.obs['leiden'],
    method_name="Smart-seq MCF7 Agglomerative Clustering"
)
print("\n")
=== Smart-seq MCF7 Agglomerative Clustering Evaluation ===
ARI: 0.9681    NMI: 0.9407

Contingency Table (raw counts):
Cluster    0    1
True             
Hypo       2  122
Norm     126    0 

Row-wise percentages (each true class → clusters):
Cluster       0      1
True                  
Hypo       1.6%  98.4%
Norm     100.0%   0.0% 

Column-wise percentages (each cluster ← true classes):
Cluster      0       1
True                  
Hypo      1.6%  100.0%
Norm     98.4%    0.0% 

Purity per cluster:
           Purity
Cluster          
0        0.984375
1        1.000000 

Overall purity: 0.9920



HCC Smart Seq¶

Leiden clustering provides better-defined clusters compared to KMeans and Agglomerative, with fewer misclassifications. However, some overlap is still observed at the cluster boundaries, potentially reflecting transitional states in cells where hypoxia was developing but not fully expressed. In any case, this is the best result we got from clustering for this cell line with overall purity of 0.9505.

In [195]:
sc.tl.leiden(adata_ss_hcc, flavor="igraph", n_iterations=2, random_state=42, resolution=0.3)
sc.pl.umap(
    adata_ss_hcc,
    color=['leiden', 'condition'],
    show=False,
    size=40,
    title=["Leiden Clustering: HCC1806 Smart-seq", "Condition for HCC1806 Smart-seq"]
)
Out[195]:
[<Axes: title={'center': 'Leiden Clustering: HCC1806 Smart-seq'}, xlabel='UMAP1', ylabel='UMAP2'>,
 <Axes: title={'center': 'Condition for HCC1806 Smart-seq'}, xlabel='UMAP1', ylabel='UMAP2'>]
No description has been provided for this image
In [196]:
#plot also in the pca space
sc.pl.pca(
    adata_ss_hcc,
    color=['leiden', 'condition'],
    show=False,
    size=40,
    components="2,3",
    title=["Leiden Clustering: HCC Drop-seq scaled", "Condition for HCC Drop-seq scaled"]
)
Out[196]:
[<Axes: title={'center': 'Leiden Clustering: HCC Drop-seq scaled'}, xlabel='PC2', ylabel='PC3'>,
 <Axes: title={'center': 'Condition for HCC Drop-seq scaled'}, xlabel='PC2', ylabel='PC3'>]
No description has been provided for this image
In [197]:
evaluate_clustering(
    adata_ss_hcc.obs['condition'],
    adata_ss_hcc.obs['leiden'],
    method_name="Smart-seq HCC1806"
)
print("\n")
=== Smart-seq HCC1806 Evaluation ===
ARI: 0.8109    NMI: 0.7390

Contingency Table (raw counts):
Cluster   0   1
True           
Hypo      8  89
Norm     84   1 

Row-wise percentages (each true class → clusters):
Cluster      0      1
True                 
Hypo      8.2%  91.8%
Norm     98.8%   1.2% 

Column-wise percentages (each cluster ← true classes):
Cluster      0      1
True                 
Hypo      8.7%  98.9%
Norm     91.3%   1.1% 

Purity per cluster:
           Purity
Cluster          
0        0.913043
1        0.988889 

Overall purity: 0.9505



MCF7 Drop Seq¶

Using a lower resolution parameter (0.1) we observe that leiden clusters align almost pefectly with conditions, reaching an overall purity of 0.9767, the best one for this dataset so far.

In [198]:
# scaled version
sc.tl.leiden(adata_ds_mcf7_scaled, flavor="igraph", n_iterations=2, random_state=42, resolution=0.1)
sc.pl.umap(
    adata_ds_mcf7_scaled,
    color=['leiden', 'condition'],
    show=False,
    size=6,
    title=["Leiden Clustering: MCF7 Drop-seq scaled", "Condition for MCF7 Drop-seq scaled"]
)
Out[198]:
[<Axes: title={'center': 'Leiden Clustering: MCF7 Drop-seq scaled'}, xlabel='UMAP1', ylabel='UMAP2'>,
 <Axes: title={'center': 'Condition for MCF7 Drop-seq scaled'}, xlabel='UMAP1', ylabel='UMAP2'>]
No description has been provided for this image
In [199]:
#plot also in the pca space
sc.pl.pca(
    adata_ds_mcf7_scaled,
    color=['leiden', 'condition'],
    show=False,
    size=20,
    components="1,3",
    title=["Leiden Clustering: MCF7 Drop-seq scaled", "Condition for MCF7 Drop-seq scaled"]
)
Out[199]:
[<Axes: title={'center': 'Leiden Clustering: MCF7 Drop-seq scaled'}, xlabel='PC1', ylabel='PC3'>,
 <Axes: title={'center': 'Condition for MCF7 Drop-seq scaled'}, xlabel='PC1', ylabel='PC3'>]
No description has been provided for this image
In [276]:
evaluate_clustering(
    adata_ds_mcf7_scaled.obs['condition'],
    adata_ds_mcf7_scaled.obs['leiden'],
    method_name=" Drop-seq MCF7 Leiden Clustering"
)
print("\n")
===  Drop-seq MCF7 Leiden Clustering Evaluation ===
ARI: 0.9090    NMI: 0.8449

Contingency Table (raw counts):
Cluster      0     1
True                
Hypo       417  8504
Norm     12619    86 

Row-wise percentages (each true class → clusters):
Cluster      0      1
True                 
Hypo      4.7%  95.3%
Norm     99.3%   0.7% 

Column-wise percentages (each cluster ← true classes):
Cluster      0      1
True                 
Hypo      3.2%  99.0%
Norm     96.8%   1.0% 

Purity per cluster:
           Purity
Cluster          
0        0.968012
1        0.989988 

Overall purity: 0.9767



HCC Drop Seq¶

Also in this case, using a lower resolution (0.16) leads to almost perfect clusters, with overall purity of 0.9409.

In [201]:
sc.tl.leiden(adata_ds_hcc_scaled, flavor="igraph", n_iterations=2, random_state=42, resolution=0.16)
sc.pl.umap(
    adata_ds_hcc_scaled,
    color=['leiden', 'condition'],
    show=False,
    size=20,
    title=["Leiden Clustering: HCC1806 Drop-seq scaled", "Condition for HCC1806 Drop-seq scaled"]
)
Out[201]:
[<Axes: title={'center': 'Leiden Clustering: HCC1806 Drop-seq scaled'}, xlabel='UMAP1', ylabel='UMAP2'>,
 <Axes: title={'center': 'Condition for HCC1806 Drop-seq scaled'}, xlabel='UMAP1', ylabel='UMAP2'>]
No description has been provided for this image
In [202]:
#plot also in the pca space
sc.pl.pca(
    adata_ds_hcc_scaled,
    color=['leiden', 'condition'],
    show=False,
    size=20,
    components="3,4",
    title=["Leiden Clustering: HCC Drop-seq scaled", "Condition for HCC Drop-seq scaled"]
)
Out[202]:
[<Axes: title={'center': 'Leiden Clustering: HCC Drop-seq scaled'}, xlabel='PC3', ylabel='PC4'>,
 <Axes: title={'center': 'Condition for HCC Drop-seq scaled'}, xlabel='PC3', ylabel='PC4'>]
No description has been provided for this image
In [203]:
evaluate_clustering(
    adata_ds_hcc_scaled.obs['condition'],
    adata_ds_hcc_scaled.obs['leiden'],
    method_name="Smart-seq HCC1806 Leiden Clustering"
)
print("\n")
=== Smart-seq HCC1806 Leiden Clustering Evaluation ===
ARI: 0.7775    NMI: 0.6787

Contingency Table (raw counts):
Cluster     0     1
True               
Hypo      634  8265
Norm     5550   233 

Row-wise percentages (each true class → clusters):
Cluster      0      1
True                 
Hypo      7.1%  92.9%
Norm     96.0%   4.0% 

Column-wise percentages (each cluster ← true classes):
Cluster      0      1
True                 
Hypo     10.3%  97.3%
Norm     89.7%   2.7% 

Purity per cluster:
           Purity
Cluster          
0        0.897477
1        0.972582 

Overall purity: 0.9409



Conclusions¶

  1. Clustering Performance:

    • The Leiden clustering algorithm consistently outperformed K-Means and Agglomerative clustering across all datasets, achieving the highest overall purity scores.
      • MCF7 Smart Seq: Leiden clustering achieved an overall purity of 0.9920, the best result for this dataset.
      • HCC Smart Seq: Leiden clustering achieved an overall purity of 0.9505, significantly better than other methods.
      • MCF7 Drop Seq: Leiden clustering achieved an overall purity of 0.9767, outperforming other clustering techniques.
      • HCC Drop Seq: Leiden clustering achieved an overall purity of 0.9409, demonstrating its robustness.
  2. Biological Insights:

    • For MCF7 Smart Seq, the clusters identified by all the techniques aligned almost perfectly with the biological conditions (hypoxia vs. normoxia), indicating a clear separation between the two states.
    • For HCC Smart Seq, while Leiden clustering provided better-defined clusters, some overlap at the boundaries suggests the presence of transitional states in cells.
    • For MCF7 Drop Seq, the clusters identified by Leiden clustering were highly consistent with the biological conditions, highlighting its effectiveness in scaled datasets.
    • For HCC Drop Seq, the clusters revealed by Leiden clustering were well-separated, but some subgroups found by Agglomerativev Clustering may represent additional biological variations.
  3. Limitations:

    • K-Means and Agglomerative clustering struggled to capture the underlying structure of the data, often resulting in lower ARI and NMI scores compared to Leiden clustering.
    • The ARI and NMI metrics tend to perform worse in scenarios where there are more clusters than conditions.
  4. Recommendations:

    • For future analyses, Leiden clustering is recommended as the primary clustering method due to its superior performance and ability to handle complex single-cell datasets.
    • Further investigation into transitional states and subgroups in datasets could provide deeper biological insights.
    • Parameter tuning, such as adjusting the resolution in Leiden clustering, can further optimize clustering results for specific datasets.

Supervised Learning: Hypoxia vs Normoxia¶

The search for a classifier involves the use of logistic regression, SVM, random forest, and multilayer perceptron models. This diversity in models allows for a more robust classifier.

These individual models are initially trained on PCA-transformed data, since the full data sets are very high-dimensional and require significant processing power. Then, feature selection for genes in the original (not PCA-transformed) data is done as well as feature selection of the PCA-transformed data to identify the top principal components.

Finally, a simple ensemble model takes the majority votes of the models trained on the selected genes for each data set. A larger generalized ensemble model then takes the ensemble of these four simples ensemble models, each trained on a different data set, and takes the majority vote of their predictions.

Preparation¶

Data¶

Earlier we extracted the PCA-transformed features using the number of components required to explain 95% of the variance in each dataset. We define X_pca_ss_mcf7 and y_pca_ss_mcf7, where X_pca_ss_mcf7 contains the reduced feature representation of each cell (principal components), and y_pca_ss_mcf7 contains the corresponding condition labels (“Hypoxia” or “Normoxia”) for each cell. These will be used as input features and target labels, respectively, for training supervised classification models.

In [ ]:
X_pca_ss_mcf7 = adata_ss_mcf7.obsm['X_pca_95']
X_pca_ss_hcc = adata_ss_hcc.obsm['X_pca_95']
X_pca_ds_mcf7 = adata_ds_mcf7.obsm['X_pca_95']
X_pca_ds_hcc = adata_ds_hcc.obsm['X_pca_95']

y_pca_ss_mcf7 = adata_ss_mcf7.obs['condition'].values
y_pca_ss_hcc = adata_ss_hcc.obs['condition'].values
y_pca_ds_mcf7 = adata_ds_mcf7.obs['condition'].values
y_pca_ds_hcc = adata_ds_hcc.obs['condition'].values
In [ ]:
print(
    f"SmartSeq MCF7 X_pca shape: {X_pca_ss_mcf7.shape}, y_pca shape: {y_pca_ss_mcf7.shape}",
    f"SmartSeq HCC1806 X_pca shape: {X_pca_ss_hcc.shape}, y_pca shape: {y_pca_ss_hcc.shape}",
    f"DropSeq MCF7 X_pca shape: {X_pca_ds_mcf7.shape}, y_pca shape: {y_pca_ds_mcf7.shape}",
    f"DropSeq HCC1806 X_pca shape: {X_pca_ds_hcc.shape}, y_pca shape: {y_pca_ds_hcc.shape}",
    sep = "\n"
)
SmartSeq MCF7 X_pca shape: (250, 20), y_pca shape: (250,)
SmartSeq HCC1806 X_pca shape: (182, 34), y_pca shape: (182,)
DropSeq MCF7 X_pca shape: (21626, 761), y_pca shape: (21626,)
DropSeq HCC1806 X_pca shape: (14682, 844), y_pca shape: (14682,)
In [ ]:
encoder = LabelEncoder()
y_pca_ss_mcf7_encoded = encoder.fit_transform(y_pca_ss_mcf7)
y_pca_ss_hcc_encoded = encoder.fit_transform(y_pca_ss_hcc)
y_pca_ds_mcf7_encoded = encoder.fit_transform(y_pca_ds_mcf7)
y_pca_ds_hcc_encoded = encoder.fit_transform(y_pca_ds_hcc)

print("Label classes:", encoder.classes_)
print("Internal encoding:", encoder.transform(encoder.classes_))
Label classes: ['Hypo' 'Norm']
Internal encoding: [0 1]

In our case:

  • 0 = 'Hypo'
  • 1 = 'Norm'
In [ ]:
print("SmartSeq MCF7:", np.unique(y_pca_ss_mcf7, return_counts = True))
print("SmartSeq HCC:", np.unique(y_pca_ss_hcc, return_counts = True))
print("DropSeq MCF7:", np.unique(y_pca_ds_mcf7, return_counts = True))
print("DropSeq HCC:", np.unique(y_pca_ds_hcc, return_counts = True))
SmartSeq MCF7: (array(['Hypo', 'Norm'], dtype=object), array([124, 126]))
SmartSeq HCC: (array(['Hypo', 'Norm'], dtype=object), array([97, 85]))
DropSeq MCF7: (array(['Hypo', 'Norm'], dtype=object), array([ 8921, 12705]))
DropSeq HCC: (array(['Hypo', 'Norm'], dtype=object), array([8899, 5783]))

Cross-validation functions¶

In [ ]:
def summarize_crossvalidation(search: GridSearchCV | RandomizedSearchCV):
    """Summarize model data for cross-validation."""
    best_model = search.best_estimator_
    
    print("Best Parameters:", search.best_params_)
    print("Best Score (CV avg):", search.best_score_)
    
    attributes = {
        "C": "C",
        # Logistic regression
        "penalty": "Penalty",
        # SVM
        "intercept_": "Intercept",
        "max_iter": "Max Iterations",
        "n_iter_": "Number of iterations for convergence",
        # Random forest
        "n_estimators": "Number of decision trees",
        "max_depth": "Maximum tree depth",
        "min_samples_split": "Minimum samples to split",
        "min_samples_leaf": "Minimum samples per leaf",
        "max_features": "Maximum features considered at each split",
        "bootstrap": "Bootstrap",
        "feature_importances_": "Feature importances",
    }
    
    for attribute, name in attributes.items():
        if hasattr(best_model, attribute):
            print(f"{name}:", getattr(best_model, attribute))

Plotting functions¶

Plot the learning curve to avoid excessive computations.

In [ ]:
def plot_learning_curve(
    search: GridSearchCV | RandomizedSearchCV,
    param_names: str | list[str],
    plot_title: str = "Learning Curve",
    scoring_label: str | None = None,
    log_scale_params: list[str] | None = None
):
    if not isinstance(param_names, list):
        param_names = [param_names]
    if log_scale_params is None:
        log_scale_params = []
    
    results = search.cv_results_
    n_params = len(param_names)
    
    # Adjust figure size depending on number of subplots
    fig, axes = plt.subplots(n_params, 1, figsize = (10, 4 * n_params), squeeze = False)
    
    if scoring_label is None:
        scoring_label = search.scoring if isinstance(search.scoring, str) else "score"
        
    for i, param in enumerate(param_names):
        raw_values = [params[param] for params in results["params"]]

        # Detect type: numeric or not
        if all(isinstance(val, (int, float)) for val in raw_values):
            param_range = np.array(raw_values)
            unique_param_range = np.unique(param_range)
            is_numeric = True
        else:
            param_range = [str(val) for val in raw_values]
            unique_param_range = sorted(set(param_range))
            is_numeric = False

        train_scores = []
        val_scores = []
        std_scores = []

        for value in unique_param_range:
            if is_numeric:
                mask = param_range == value
            else:
                mask = [v == value for v in param_range]

            train_scores.append(np.mean(np.array(results["mean_train_score"])[mask]))
            val_scores.append(np.mean(np.array(results["mean_test_score"])[mask]))
            std_scores.append(np.mean(np.array(results["std_test_score"])[mask]))

        axis = axes[i, 0]
        x_values = unique_param_range if is_numeric else range(len(unique_param_range))

        # Plot training scores on left y-axis
        axis.plot(x_values, train_scores, label = "Training score", marker = "o", color = "tab:blue")
        axis.set_ylabel(f"Train {scoring_label}", color = "tab:blue")
        axis.tick_params(axis = "y", labelcolor = "tab:blue")

        # Plot validation scores on right y-axis
        axis2 = axis.twinx()
        axis2.plot(x_values, val_scores, label = "Validation score", marker = "s", color = "tab:orange")
        axis2.fill_between(
            x_values,
            np.array(val_scores) - np.array(std_scores),
            np.array(val_scores) + np.array(std_scores),
            alpha = 0.2,
            color = "tab:orange"
        )
        axis2.set_ylabel(f"Validation {scoring_label}", color = "tab:orange")
        axis2.tick_params(axis = "y", labelcolor = "tab:orange")

        axis.set_title(f"{plot_title} ({param})")
        axis.set_xlabel(param)
        axis.grid(True)

        if param in log_scale_params:
            axis.set_xscale("log")
            axis2.set_xscale("log")

        if not is_numeric:
            axis.set_xticks(x_values)
            axis.set_xticklabels(unique_param_range, rotation = 45)

    plt.tight_layout()
    plt.show()

Test function¶

In [ ]:
def test_model(model, X_test, y_test, verbose: bool = True):
    if verbose:
        print("========================= Testing =========================")
    predictions = model.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    
    if verbose:
        cm = confusion_matrix(y_test, predictions)
        labels = ['Hypo', 'Norm']
        cm_df = pd.DataFrame(cm, index=[f"Actual {l}" for l in labels], columns = [f"Predicted {l}" for l in labels])
        print("Confusion matrix:")
        print(cm_df)
        print("Accuracy:", accuracy)
        print("Classification report:\n", classification_report(y_test, predictions))
    
    return accuracy

Custom classifier class¶

This classifier wrapper class allows the specific train-test split to be stored alongside the model to allow ensembling without reusing train data.

In [ ]:
class TrainedModelWrapper:
    def __init__(
        self,
        model: BaseEstimator,
        X,
        y,
        X_train,
        y_train,
        X_test,
        y_test,
        accuracy: float
    ):
        self.model = model
        self.X = X
        self.y = y
        self.X_train = X_train
        self.y_train = y_train
        self.X_test = X_test
        self.y_test = y_test
        self.accuracy = accuracy
    
    def predict(self, X):
        return self.model.predict(X)
    
    def score(self, X, y):
        return self.model.score(X, y)

    def summary(self, verbose = True):
        if verbose:
            print("Model:", type(self.model).__name__)
            print("Training set size:", self.X_train.shape)
            print("Test set size:", self.X_test.shape)
            print("Accuracy on test set:", self.accuracy)

Logistic Regression¶

Logistic regression provides a simple and interpretable linear model which is well-suited for binary classification. The model's coefficients provide insight into the importance of the features.

Key hyperparameters¶

  • penalty: L2 regularization is chosen to prevent overfitting and prioritize accuracy and stability.
  • C: regularization strength that controls trade-off between fitting the data well and regularizing the coefficients.
  • solver: For larger data sets, the SAG (stochastic average gradient) solver is used to approximate gradient descent in a way that scales better with large data.

Training¶

Grid/randomized search cross-validation is an important step of training to identify the optimal hyperparameters.

In [ ]:
def train_logistic_regression(
    X_train,
    y_train,
    random_state: int | None = None,
    n_jobs: int | None = None,
    verbose: bool = True
):
    if verbose:
        print("========================= Training =========================")
    
    n_samples = X_train.shape[0]
    
    params = {
        "penalty": ["l2"],
        "C": [0.01, 0.1, 1, 10],
    } if n_samples < 10_000 else {
        "penalty": ["l2"],
        "C": [0.01, 0.1, 1, 10],
        "solver": ["sag"] # More efficient on larger data sets
    }
    
    model = GridSearchCV(
        estimator = LogisticRegression(max_iter = 20_000, random_state = random_state, n_jobs = n_jobs),
        param_grid = params,
        refit = True,
        cv = 5,
        n_jobs = n_jobs,
        return_train_score = True
    ) if n_samples < 10_000 else RandomizedSearchCV(
        estimator = LogisticRegression(max_iter = 20_000, random_state = random_state, n_jobs = n_jobs),
        param_distributions = params,
        random_state = random_state,
        refit = True,
        cv = 5,
        n_jobs = n_jobs,
        return_train_score = True
    )
    
    model.fit(X_train, y_train)
    
    if verbose:
        summarize_crossvalidation(model)
        print("Training accuracy:", model.score(X_train, y_train))
    
    return model.best_estimator_

Evaluation¶

The train-test split is stratified to ensure the split is representative of the labels.

In [ ]:
def train_test_logistic_regression(
    X,
    y,
    test_size: float = 0.25,
    train_size: float | None = None,
    random_state: int = 10,
    n_jobs: int | None = None,
    verbose: bool = True
):
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size = test_size, train_size = train_size, random_state = random_state, stratify = y
    )
    
    if verbose:
        print("Training data dimensions:", X_train.shape)
        print("Testing data dimensions:", X_test.shape)
    
    # Train the model
    model = train_logistic_regression(X_train = X_train, y_train = y_train, random_state = random_state, n_jobs = n_jobs, verbose = verbose)
    
    # Evaluate the model
    accuracy = test_model(model = model, X_test = X_test, y_test = y_test, verbose = verbose)
    
    return TrainedModelWrapper(
        model = model,
        X = X,
        y = y,
        X_train = X_train,
        y_train = y_train,
        X_test = X_test,
        y_test = y_test,
        accuracy = accuracy
    )
In [ ]:
ss_mcf7_pca_logit = train_test_logistic_regression(X_pca_ss_mcf7, y_pca_ss_mcf7, n_jobs = -1)
Training data dimensions: (187, 20)
Testing data dimensions: (63, 20)
========================= Training =========================
Best Parameters: {'C': 0.01, 'penalty': 'l2'}
Best Score (CV avg): 0.9891891891891891
C: 0.01
Penalty: l2
Intercept: [-8.10718121e-07]
Max Iterations: 10000
Number of iterations for convergence: [35]
Training accuracy: 1.0
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo              31               0
Actual Norm               0              32
Accuracy: 1.0
Classification report:
               precision    recall  f1-score   support

        Hypo       1.00      1.00      1.00        31
        Norm       1.00      1.00      1.00        32

    accuracy                           1.00        63
   macro avg       1.00      1.00      1.00        63
weighted avg       1.00      1.00      1.00        63

In [ ]:
ss_hcc_pca_logit = train_test_logistic_regression(X_pca_ss_hcc, y_pca_ss_hcc, n_jobs = -1)
Training data dimensions: (136, 34)
Testing data dimensions: (46, 34)
========================= Training =========================
Best Parameters: {'C': 0.01, 'penalty': 'l2'}
Best Score (CV avg): 0.978042328042328
C: 0.01
Penalty: l2
Intercept: [-1.46755599e-06]
Max Iterations: 10000
Number of iterations for convergence: [46]
Training accuracy: 1.0
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo              24               1
Actual Norm               0              21
Accuracy: 0.9782608695652174
Classification report:
               precision    recall  f1-score   support

        Hypo       1.00      0.96      0.98        25
        Norm       0.95      1.00      0.98        21

    accuracy                           0.98        46
   macro avg       0.98      0.98      0.98        46
weighted avg       0.98      0.98      0.98        46

In [ ]:
ds_mcf7_pca_logit = train_test_logistic_regression(X_pca_ds_mcf7, y_pca_ds_mcf7, n_jobs = -1)
Training data dimensions: (16219, 761)
Testing data dimensions: (5407, 761)
========================= Training =========================
Best Parameters: {'solver': 'sag', 'penalty': 'l2', 'C': 0.1}
Best Score (CV avg): 0.9792835217881786
C: 0.1
Penalty: l2
Intercept: [1.]
Max Iterations: 20000
Number of iterations for convergence: [20000]
Training accuracy: 0.9901966829027684
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo            2166              64
Actual Norm              46            3131
Accuracy: 0.9796560014795636
Classification report:
               precision    recall  f1-score   support

        Hypo       0.98      0.97      0.98      2230
        Norm       0.98      0.99      0.98      3177

    accuracy                           0.98      5407
   macro avg       0.98      0.98      0.98      5407
weighted avg       0.98      0.98      0.98      5407

In [ ]:
ds_hcc_pca_logit = train_test_logistic_regression(X_pca_ds_hcc, y_pca_ds_hcc, n_jobs = -1)
Training data dimensions: (11011, 844)
Testing data dimensions: (3671, 844)
========================= Training =========================
Best Parameters: {'solver': 'sag', 'penalty': 'l2', 'C': 0.1}
Best Score (CV avg): 0.9554083833332715
C: 0.1
Penalty: l2
Intercept: [-2.1879232]
Max Iterations: 20000
Number of iterations for convergence: [132]
Training accuracy: 0.9741167922986105
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo            2116             109
Actual Norm              80            1366
Accuracy: 0.9485153909016617
Classification report:
               precision    recall  f1-score   support

        Hypo       0.96      0.95      0.96      2225
        Norm       0.93      0.94      0.94      1446

    accuracy                           0.95      3671
   macro avg       0.94      0.95      0.95      3671
weighted avg       0.95      0.95      0.95      3671

Support Vector Machine¶

Support vector machine is an effective classifier in high-dimensional data and can use the kernel trick for non-linear boundaries resulting from complex relationships in the data (however, LinearSVC is found to be the best classifier).

Key hyperparameters¶

  • penalty: L2 regularization is used by default in LinearSVC.
  • C: Controls the regularization strength.

Training¶

In [ ]:
def train_svm(
    X_train,
    y_train,
    random_state: int | None = None,
    n_jobs: int | None = None,
    verbose: bool = True
):
    if verbose:
        print("========================= Training =========================")
        
    params = {
        "C": [0.025, 0.05, 0.1, 1, 10, 50]
    }
    
    model = GridSearchCV(
        estimator = LinearSVC(random_state = random_state, max_iter = 10_000),
        param_grid = params,
        refit = True,
        cv = 5,
        n_jobs = n_jobs,
        return_train_score = True
    )
    
    model.fit(X_train, y_train)
    
    if verbose:
        summarize_crossvalidation(model)
        print("Training accuracy:", model.score(X_train, y_train))
        
        plot_learning_curve(model, list(params.keys()))
    
    return model.best_estimator_

Evaluation¶

In [ ]:
def train_test_svm(
    X,
    y,
    test_size: float = 0.25,
    train_size: float | None = None,
    random_state: int = 10,
    n_jobs: int | None = None,
    verbose: bool = True
):
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size = test_size, train_size = train_size, random_state = random_state, stratify = y
    )
    
    if verbose:
        print("Training data dimensions:", X_train.shape)
        print("Testing data dimensions:", X_test.shape)
    
    # Train the model
    model = train_svm(X_train = X_train, y_train = y_train, random_state = random_state, n_jobs = n_jobs, verbose = verbose)
    
    # Evaluate the model
    accuracy = test_model(model = model, X_test = X_test, y_test = y_test, verbose = verbose)
    
    return TrainedModelWrapper(
        model = model,
        X = X,
        y = y,
        X_train = X_train,
        y_train = y_train,
        X_test = X_test,
        y_test = y_test,
        accuracy = accuracy
    )
In [ ]:
ss_mcf7_pca_svm = train_test_svm(X_pca_ss_mcf7, y_pca_ss_mcf7, n_jobs = -1)
Training data dimensions: (187, 20)
Testing data dimensions: (63, 20)
========================= Training =========================
Best Parameters: {'C': 0.025}
Best Score (CV avg): 0.9891891891891891
C: 0.025
Penalty: l2
Intercept: [-1.68452782e-08]
Max Iterations: 10000
Number of iterations for convergence: 8
Training accuracy: 1.0
No description has been provided for this image
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo              31               0
Actual Norm               0              32
Accuracy: 1.0
Classification report:
               precision    recall  f1-score   support

        Hypo       1.00      1.00      1.00        31
        Norm       1.00      1.00      1.00        32

    accuracy                           1.00        63
   macro avg       1.00      1.00      1.00        63
weighted avg       1.00      1.00      1.00        63

In [ ]:
ss_hcc_pca_svm = train_test_svm(X_pca_ss_hcc, y_pca_ss_hcc, n_jobs = -1)
Training data dimensions: (136, 34)
Testing data dimensions: (46, 34)
========================= Training =========================
Best Parameters: {'C': 0.025}
Best Score (CV avg): 0.9634920634920634
C: 0.025
Penalty: l2
Intercept: [-1.02120292e-07]
Max Iterations: 10000
Number of iterations for convergence: 9
Training accuracy: 1.0
No description has been provided for this image
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo              24               1
Actual Norm               0              21
Accuracy: 0.9782608695652174
Classification report:
               precision    recall  f1-score   support

        Hypo       1.00      0.96      0.98        25
        Norm       0.95      1.00      0.98        21

    accuracy                           0.98        46
   macro avg       0.98      0.98      0.98        46
weighted avg       0.98      0.98      0.98        46

In [ ]:
ds_mcf7_pca_svm = train_test_svm(X_pca_ds_mcf7, y_pca_ds_mcf7, n_jobs = -1)
Training data dimensions: (16219, 761)
Testing data dimensions: (5407, 761)
========================= Training =========================
Best Parameters: {'C': 0.025}
Best Score (CV avg): 0.9773103826395694
C: 0.025
Penalty: l2
Intercept: [0.32791924]
Max Iterations: 10000
Number of iterations for convergence: 8
Training accuracy: 0.9903199950675134
No description has been provided for this image
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo            2163              67
Actual Norm              51            3126
Accuracy: 0.9781764379508046
Classification report:
               precision    recall  f1-score   support

        Hypo       0.98      0.97      0.97      2230
        Norm       0.98      0.98      0.98      3177

    accuracy                           0.98      5407
   macro avg       0.98      0.98      0.98      5407
weighted avg       0.98      0.98      0.98      5407

In [ ]:
ds_hcc_pca_svm = train_test_svm(X_pca_ds_hcc, y_pca_ds_hcc, n_jobs = -1)
Training data dimensions: (11011, 844)
Testing data dimensions: (3671, 844)
========================= Training =========================
Best Parameters: {'C': 0.025}
Best Score (CV avg): 0.9528651995070714
C: 0.025
Penalty: l2
Intercept: [-0.82237322]
Max Iterations: 10000
Number of iterations for convergence: 8
Training accuracy: 0.9789301607483426
No description has been provided for this image
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo            2112             113
Actual Norm              82            1364
Accuracy: 0.9468809588667938
Classification report:
               precision    recall  f1-score   support

        Hypo       0.96      0.95      0.96      2225
        Norm       0.92      0.94      0.93      1446

    accuracy                           0.95      3671
   macro avg       0.94      0.95      0.94      3671
weighted avg       0.95      0.95      0.95      3671

Random Forest¶

Random forest leverages the power of ensembling to provide accurate predictions while also being able to handle non-linearity and interactions between features. The feature importance scores help with interpretability of the model and the features which contribute to hypoxia classification. Bootstrap aggregation and random feature selection make this model very robust.

Key hyperparameters¶

  • n_estimators: The number of decision trees in the forest. More trees can improve performance, given a sufficiently large data set, but increase training time.
  • max_depth: The maximum depth of each tree limits model complexity to prevent overfitting.
  • min_samples_split: The minimum number of samples required to split an internal node.
    • Higher values reduce overfitting.
  • min_samples_leaf: The minimum number of samples required to be in a leaf node.
    • Helps smooth the model and prevent learning from outliers.
  • max_features: The number of features to consider when looking for the best split.
    • Controls tree diversity and model variance.
  • bootstrap: Whether bootstrap samples are used when building trees.
    • Introduces randomness for better generalization.

Training¶

There are different parameter options depending on the size of the training set to avoid overfitting while also minimizing computational complexity. Grid search is used for cross-validation on smaller data sets and randomized search on larger sets as the computation time increases significantly with the size of the data set. Plots of the learning curves for each hyperparameter are helpful for narrowing down the pool of hyperparameters to then run more comprehensive searches.

As the data set gets larger, the model becomes more robust to the individual decision trees, so certain hyperparameter pools like the number of trees can be relaxed.

Initially, the confusion matrix for the model (specifically on DropSeq HCC) showed the model predicting a significant portion of normoxia as hypoxia, with a 15% difference between the training and testing scores, suggesting overfitting. The scorer for GridSearchCV and RandomizedSearchCV was changed to f1_macro to better accommodate the uneven distribution of labels in the data.

In [ ]:
def train_random_forest(
    X_train,
    y_train,
    random_state: int | None = None,
    n_jobs: int | None = None,
    verbose: bool = True
):  
    if verbose:
        print("========================= Training =========================")
    n_samples = X_train.shape[0]
    
    params = {
        "n_estimators": [25, 50, 100],
        "max_depth": [5, 10, 20, None],
        "min_samples_split": [2, 5, 10],
        "min_samples_leaf": [1, 2, 4, 10],
        "max_features": ["sqrt", "log2", None],
        "bootstrap": [True, False]
    } if n_samples < 1_000 else {
        "n_estimators": [100, 200, 300, 400],
        "class_weight": ["balanced"],
        "max_depth": [5, 10, 20],
        "min_samples_split": [5, 10, 20],
        "min_samples_leaf": [1, 2, 5, 10, 25],
        "max_features": ["sqrt"],
        "bootstrap": [True]
    } if n_samples < 15_000 else {
        "n_estimators": [100, 200, 400, 600],
        "class_weight": ["balanced"],
        "max_depth": [10, 20, 30],
        "min_samples_split": [2, 5, 10],
        "min_samples_leaf": [1, 2, 4],
        "max_features": ["sqrt"],
        "bootstrap": [True]
    }
    
    model = GridSearchCV(
        estimator = RandomForestClassifier(random_state = random_state, n_jobs = n_jobs),
        param_grid = params,
        scoring = "f1_macro",
        refit = True,
        cv = 5,
        n_jobs = n_jobs,
        return_train_score = True
    ) if n_samples < 10_000 else RandomizedSearchCV(
        estimator = RandomForestClassifier(random_state = random_state, n_jobs = n_jobs),
        param_distributions = params,
        random_state = random_state,
        scoring = "f1_macro",
        refit = True,
        cv = 5,
        n_jobs = n_jobs,
        return_train_score = True
    )
    
    model.fit(X_train, y_train)
    
    if verbose:
        summarize_crossvalidation(model)
        print("Training accuracy:", model.score(X_train, y_train))
        
        plot_learning_curve(model, list(params.keys()))
    
    return model.best_estimator_

Evaluation¶

In [ ]:
def train_test_random_forest(
    X,
    y,
    test_size: float = 0.25,
    train_size: float | None = None,
    random_state: int = 10,
    n_jobs: int | None = None,
    verbose: bool = True
):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size = test_size, train_size = train_size, random_state = random_state, stratify = y
    )
    
    if verbose:
        print("Training data dimensions:", X_train.shape)
        print("Testing data dimensions:", X_test.shape)
    
    model = train_random_forest(X_train = X_train, y_train = y_train, random_state = random_state, n_jobs =  n_jobs, verbose = verbose)
    accuracy = test_model(model = model, X_test = X_test, y_test = y_test, verbose = verbose)
    
    return TrainedModelWrapper(
        model = model,
        X = X,
        y = y,
        X_train = X_train,
        y_train = y_train,
        X_test = X_test,
        y_test = y_test,
        accuracy = accuracy
    )
In [ ]:
ss_mcf7_pca_random_forest = train_test_random_forest(X_pca_ss_mcf7, y_pca_ss_mcf7, n_jobs = -1)
Training data dimensions: (187, 20)
Testing data dimensions: (63, 20)
========================= Training =========================
Best Parameters: {'bootstrap': True, 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 25}
Best Score (CV avg): 0.9945945945945945
Number of decision trees: 25
Maximum tree depth: 5
Minimum samples to split: 10
Minimum samples per leaf: 4
Maximum features considered at each split: sqrt
Bootstrap: True
Feature importances: [4.87643283e-01 7.33065976e-02 6.00266784e-02 9.91685382e-02
 8.78119702e-02 8.30256017e-02 4.92950249e-03 6.29815380e-04
 5.22548266e-02 1.62137250e-02 8.48520910e-04 5.76192650e-04
 7.17375907e-03 1.37367232e-02 6.91512970e-04 1.09635964e-02
 2.10741782e-04 5.07314934e-04 2.81099111e-04 0.00000000e+00]
Training accuracy: 1.0
No description has been provided for this image
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo              31               0
Actual Norm               0              32
Accuracy: 1.0
Classification report:
               precision    recall  f1-score   support

        Hypo       1.00      1.00      1.00        31
        Norm       1.00      1.00      1.00        32

    accuracy                           1.00        63
   macro avg       1.00      1.00      1.00        63
weighted avg       1.00      1.00      1.00        63

In [ ]:
ss_hcc_pca_random_forest = train_test_random_forest(X_pca_ss_hcc, y_pca_ss_hcc, n_jobs = -1)
Training data dimensions: (136, 34)
Testing data dimensions: (46, 34)
========================= Training =========================
Best Parameters: {'bootstrap': False, 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 10, 'min_samples_split': 2, 'n_estimators': 50}
Best Score (CV avg): 0.9852743878956579
Number of decision trees: 50
Maximum tree depth: 10
Minimum samples to split: 2
Minimum samples per leaf: 10
Maximum features considered at each split: sqrt
Bootstrap: False
Feature importances: [0.0064334  0.26821343 0.39898388 0.04399438 0.0053163  0.01452682
 0.01100844 0.01197652 0.01214636 0.02534943 0.0084228  0.02346837
 0.00617053 0.01136463 0.01097255 0.00409977 0.00640453 0.00222945
 0.01518255 0.00390206 0.00612587 0.00965581 0.00137775 0.00401086
 0.00990919 0.01525441 0.00095232 0.00268207 0.01150917 0.00975953
 0.00677455 0.00704284 0.01695411 0.0078253 ]
Training accuracy: 0.9926147162639153
No description has been provided for this image
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo              23               2
Actual Norm               1              20
Accuracy: 0.9347826086956522
Classification report:
               precision    recall  f1-score   support

        Hypo       0.96      0.92      0.94        25
        Norm       0.91      0.95      0.93        21

    accuracy                           0.93        46
   macro avg       0.93      0.94      0.93        46
weighted avg       0.94      0.93      0.93        46

In [ ]:
ds_mcf7_pca_random_forest = train_test_random_forest(X_pca_ds_mcf7, y_pca_ds_mcf7, n_jobs = -1)
Training data dimensions: (16219, 761)
Testing data dimensions: (5407, 761)
========================= Training =========================
Best Parameters: {'n_estimators': 600, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 10, 'class_weight': 'balanced', 'bootstrap': True}
Best Score (CV avg): 0.9246355565116522
Number of decision trees: 600
Maximum tree depth: 10
Minimum samples to split: 5
Minimum samples per leaf: 2
Maximum features considered at each split: sqrt
Bootstrap: True
Feature importances: [0.10531168 0.1298405  0.14376609 0.00890776 0.02425938 0.02095302
 0.002667   0.00859892 0.00154817 0.00104473 0.00099024 0.00243691
 0.00079045 0.00100409 0.00238867 0.00420052 0.00227854 0.00758007
 0.00386235 0.00065713 0.00139259 0.00101725 0.00311133 0.00083539
 0.00550909 0.00690945 0.00146923 0.00197005 0.00126192 0.00069935
 0.00125397 0.00175273 0.00311906 0.00106822 0.00351382 0.01149722
 0.01603425 0.00457864 0.000856   0.00063889 0.00058418 0.00055282
 0.00046926 0.0009799  0.00039346 0.00069561 0.00615267 0.00423996
 0.00028781 0.00071115 0.00033103 0.00030638 0.00031317 0.00055128
 0.00434507 0.00109667 0.00033435 0.00051367 0.00043008 0.00183359
 0.00028726 0.00035108 0.0004725  0.00035709 0.00052879 0.00059977
 0.00036044 0.00033627 0.00049262 0.00032868 0.0002728  0.00037134
 0.00028611 0.0002421  0.00067918 0.00034513 0.00031624 0.00036998
 0.00036177 0.0004618  0.00050901 0.00520248 0.00038603 0.00054757
 0.00092149 0.00077227 0.00048457 0.00033451 0.00029814 0.00049587
 0.00048314 0.00044835 0.00026905 0.00043941 0.0035963  0.00043558
 0.00365572 0.00072697 0.000371   0.00047457 0.00056056 0.00030292
 0.00054543 0.0004982  0.00063516 0.00039058 0.00041866 0.00034867
 0.00034766 0.00258043 0.00035767 0.00023369 0.00102244 0.0003174
 0.00036132 0.00185167 0.00068316 0.00114183 0.00100102 0.00026577
 0.00137913 0.00027339 0.0004236  0.00043554 0.00026215 0.00044223
 0.00055951 0.00046921 0.00024452 0.00068093 0.00025074 0.00041923
 0.00326383 0.00034942 0.00096892 0.00032075 0.00023898 0.00031992
 0.00033645 0.00150273 0.00028185 0.00498445 0.00044432 0.00198335
 0.00487231 0.00037248 0.00274954 0.00031893 0.00445734 0.00029965
 0.00489857 0.00033456 0.00027911 0.00021839 0.0004373  0.00026372
 0.00479623 0.00075912 0.00048689 0.00025701 0.00052364 0.00030578
 0.00061907 0.00032457 0.00040429 0.00057435 0.00302461 0.00041054
 0.00026883 0.00149999 0.00043445 0.00046755 0.00031501 0.00047634
 0.00042197 0.0006877  0.00033929 0.00041833 0.00048601 0.00044755
 0.00034759 0.00027137 0.00045445 0.00029416 0.00028925 0.00038233
 0.00046582 0.00024644 0.00030989 0.00027655 0.00123483 0.000325
 0.00037917 0.00031292 0.00024639 0.00034892 0.0002873  0.00034006
 0.00021666 0.00032756 0.0006794  0.00028754 0.00060304 0.00037378
 0.00028748 0.00043884 0.00033982 0.00031651 0.00027366 0.0002707
 0.00025703 0.00024373 0.00031308 0.00026148 0.00028144 0.00029017
 0.00039625 0.00024265 0.00029547 0.00032254 0.00031326 0.00040388
 0.00025408 0.00040568 0.00024113 0.0003202  0.00037481 0.00039226
 0.00027595 0.00036615 0.00035766 0.00145637 0.00032405 0.00037314
 0.00077107 0.00147386 0.00029295 0.00075031 0.00070221 0.00040906
 0.00073319 0.00092325 0.00036494 0.00039274 0.00025713 0.00043654
 0.00095813 0.00038295 0.00042241 0.00026091 0.0003436  0.00052236
 0.00037572 0.00038373 0.00030407 0.00029102 0.00030134 0.00042555
 0.00032727 0.00048261 0.0004383  0.000307   0.00022795 0.00083452
 0.00031865 0.00029358 0.0007342  0.00027078 0.0002773  0.00043468
 0.00123634 0.00019869 0.00035141 0.00100078 0.0006563  0.00043935
 0.00055915 0.00398463 0.00078137 0.000493   0.00060457 0.00047938
 0.0003374  0.00050419 0.00044541 0.0005015  0.00051325 0.00047102
 0.00031658 0.00083919 0.00044009 0.00041698 0.00031938 0.00127632
 0.00079488 0.00043122 0.00131825 0.00040851 0.00048043 0.00139255
 0.0019101  0.00027901 0.00165899 0.00041956 0.00047509 0.00078968
 0.00197831 0.00051353 0.00035162 0.00062458 0.00083784 0.00059051
 0.00071212 0.0035567  0.00031613 0.00063115 0.00450602 0.00317074
 0.00054999 0.00177576 0.00451337 0.0027391  0.0005018  0.00072116
 0.00030761 0.00345046 0.0011965  0.00227663 0.00029977 0.00043484
 0.0013455  0.00086135 0.00040533 0.00088912 0.00684443 0.00053045
 0.00072352 0.00209918 0.00051302 0.00042435 0.00038161 0.00067255
 0.00041105 0.00256786 0.00094846 0.0006198  0.00044308 0.00094528
 0.00051006 0.00064724 0.00043295 0.00118618 0.00051426 0.00029167
 0.00039577 0.00067263 0.00071725 0.00150253 0.000357   0.00024314
 0.00058133 0.0003918  0.00124965 0.00130424 0.00034879 0.00068637
 0.00142809 0.00066068 0.00046248 0.00057926 0.00100908 0.00081208
 0.00062834 0.00099082 0.00125176 0.00116258 0.00157917 0.00120138
 0.00178949 0.00209933 0.00294813 0.00061149 0.00185401 0.00039849
 0.00055386 0.00037537 0.00079814 0.00078185 0.00037376 0.00044869
 0.00079072 0.00038771 0.00065747 0.00128027 0.00100872 0.00040445
 0.00038032 0.00039798 0.00038978 0.00078005 0.00102374 0.00081737
 0.00239271 0.00070264 0.00073869 0.00034526 0.00056761 0.0005756
 0.00048465 0.00056605 0.000443   0.00084507 0.00053156 0.00053208
 0.00061853 0.00083416 0.0002911  0.00053235 0.00062375 0.00049954
 0.00112582 0.00250048 0.00044941 0.00141374 0.00048433 0.00113186
 0.00052561 0.00041922 0.00034755 0.00056605 0.00043394 0.00037361
 0.00045622 0.00067381 0.0004423  0.00053031 0.00041956 0.00107256
 0.00079628 0.00037066 0.00049364 0.00169165 0.00043079 0.00081818
 0.0003506  0.00050345 0.00111144 0.00041988 0.00044528 0.00103099
 0.00081886 0.00075629 0.00040685 0.00030028 0.000571   0.00036758
 0.00034553 0.00154734 0.00058072 0.00036249 0.00051047 0.00048956
 0.00070622 0.00099259 0.00049449 0.00045869 0.00029623 0.0005214
 0.00035259 0.00033357 0.00033984 0.00031595 0.00043209 0.00133602
 0.00133612 0.00037854 0.00083018 0.00083259 0.00038599 0.00076312
 0.00046143 0.00033456 0.00107701 0.00028666 0.00041183 0.00066119
 0.00054765 0.00071805 0.00049177 0.00063936 0.00030198 0.00037442
 0.00038605 0.00063903 0.000651   0.00083435 0.00032801 0.00097488
 0.00054724 0.00048431 0.00039327 0.00053298 0.00056161 0.00033175
 0.00027273 0.00053975 0.00043248 0.00055294 0.00027547 0.00031195
 0.00041462 0.00036052 0.00030161 0.00035572 0.00041122 0.0004305
 0.0004068  0.00040385 0.00038672 0.00069569 0.00027762 0.00049147
 0.00036931 0.00028512 0.00028854 0.00045523 0.0003621  0.00034837
 0.00023828 0.00022252 0.00033068 0.00043415 0.00034472 0.00033524
 0.00025828 0.000327   0.00037402 0.00029807 0.00029158 0.0003553
 0.00041216 0.00043521 0.00036297 0.00040371 0.0003658  0.00034593
 0.00034808 0.00061949 0.00031006 0.0003575  0.00042709 0.00039049
 0.00033915 0.00067957 0.00045901 0.00026427 0.00031127 0.00035066
 0.00031878 0.00032873 0.00026781 0.00036249 0.00043142 0.00085352
 0.00044547 0.0004821  0.00025455 0.00033042 0.00054598 0.00039337
 0.00032047 0.00034438 0.00035296 0.00026788 0.00037184 0.00030585
 0.00023866 0.00021862 0.00043793 0.00033364 0.00031334 0.00041238
 0.00030189 0.00070283 0.00031154 0.00035906 0.0003818  0.00037603
 0.00029103 0.00040264 0.00039551 0.00030778 0.00034299 0.00056394
 0.0003813  0.0003787  0.00027861 0.00032795 0.00030901 0.00033301
 0.00044004 0.00037484 0.00027191 0.00034013 0.00046865 0.00034942
 0.00082124 0.0003635  0.00038131 0.00056936 0.00038337 0.00042241
 0.00023855 0.0003306  0.00028391 0.00031094 0.00033818 0.00026622
 0.00022612 0.00031748 0.00035502 0.00034172 0.00038318 0.00038781
 0.00034458 0.00035536 0.00031096 0.00034362 0.00028379 0.00029752
 0.00031086 0.00042886 0.00027878 0.00050551 0.00032571 0.00046166
 0.0004208  0.00035131 0.00027304 0.00034648 0.00026529 0.0005698
 0.0003201  0.00028278 0.00030133 0.00034648 0.00030764 0.00026646
 0.00034035 0.00053556 0.00037732 0.00034907 0.00028583 0.00042038
 0.00025485 0.00025552 0.00031885 0.00031083 0.00032842 0.00027913
 0.00040893 0.00026345 0.00038301 0.00028446 0.00034416 0.00037919
 0.00056237 0.00037488 0.00036957 0.00046009 0.00037346 0.00046441
 0.0004007  0.00038125 0.00032505 0.00044276 0.00031539 0.00036569
 0.00033101 0.00030341 0.00075396 0.00035388 0.00037215 0.00034485
 0.00031414 0.00045129 0.00026111 0.00040362 0.00036975 0.00032967
 0.00031811 0.00041005 0.00026292 0.00026916 0.00039454 0.00028067
 0.00032729 0.00078951 0.0003763  0.00040922 0.00040862 0.00024828
 0.00028074 0.00033801 0.00028914 0.00039496 0.00033345 0.00035551
 0.00040217 0.00037222 0.00040346 0.00042141 0.00040244 0.00030947
 0.00047869 0.00052788 0.00060206 0.00036037 0.0004646  0.00046887
 0.00028266 0.00041449 0.000472   0.00032433 0.00038677 0.00039108
 0.00038871 0.00040167 0.00036875 0.00034833 0.00054566 0.00048124
 0.00038589 0.0003157  0.00035669 0.00034465 0.00032817 0.00035557
 0.00037518 0.00058563 0.00055149 0.00043627 0.00078367 0.00047058
 0.00040339 0.00036191 0.00030057 0.00049882 0.00030993 0.00052802
 0.0004069  0.00046966 0.00052695 0.00048346 0.00050339 0.00037464
 0.00032322 0.00060991 0.00059623 0.00044005 0.00047888]
Training accuracy: 0.9751030814126183
No description has been provided for this image
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo            2011             219
Actual Norm             136            3041
Accuracy: 0.9343443684113186
Classification report:
               precision    recall  f1-score   support

        Hypo       0.94      0.90      0.92      2230
        Norm       0.93      0.96      0.94      3177

    accuracy                           0.93      5407
   macro avg       0.93      0.93      0.93      5407
weighted avg       0.93      0.93      0.93      5407

In [ ]:
ds_hcc_pca_random_forest = train_test_random_forest(X_pca_ds_hcc, y_pca_ds_hcc, n_jobs = -1)
Training data dimensions: (11011, 844)
Testing data dimensions: (3671, 844)
========================= Training =========================
Best Parameters: {'n_estimators': 400, 'min_samples_split': 5, 'min_samples_leaf': 5, 'max_features': 'sqrt', 'max_depth': 10, 'class_weight': 'balanced', 'bootstrap': True}
Best Score (CV avg): 0.8894451495995834
Number of decision trees: 400
Maximum tree depth: 10
Minimum samples to split: 5
Minimum samples per leaf: 5
Maximum features considered at each split: sqrt
Bootstrap: True
Feature importances: [0.01556211 0.01332938 0.02437381 0.00887438 0.05431453 0.15896834
 0.0013507  0.00367518 0.00127416 0.00183298 0.00448067 0.01776261
 0.00184872 0.00125025 0.00918616 0.00086553 0.00197497 0.01854113
 0.00168007 0.01304374 0.00213133 0.00061732 0.00244949 0.0016368
 0.00054001 0.00173133 0.00116409 0.00063809 0.00097648 0.00168106
 0.00133609 0.0007371  0.00084775 0.00210989 0.00224028 0.00602436
 0.00873917 0.00202338 0.01109634 0.00044959 0.00130168 0.00196231
 0.00074593 0.00373249 0.00143229 0.00463699 0.00272285 0.00119899
 0.00535349 0.00074939 0.00075856 0.00078748 0.00214877 0.00163909
 0.0007553  0.0012141  0.00060405 0.00046447 0.00061247 0.00294571
 0.00077433 0.00064645 0.00122328 0.00065023 0.0075631  0.00077905
 0.00080332 0.00223036 0.00135002 0.00212975 0.00111877 0.00119236
 0.0017247  0.00461049 0.00386357 0.00376939 0.00104736 0.00498201
 0.0010248  0.00330714 0.00079378 0.00048989 0.00059584 0.00049261
 0.00068026 0.00057738 0.00080852 0.00058372 0.00039259 0.00296434
 0.00058196 0.00059213 0.00051131 0.0008211  0.00067767 0.00061226
 0.00051419 0.00087139 0.00716098 0.00244524 0.00076278 0.00281701
 0.01749129 0.00084529 0.00055982 0.00222925 0.00048428 0.00041596
 0.0016672  0.00311293 0.00113994 0.0005407  0.0007626  0.00047769
 0.00052031 0.00050358 0.00053626 0.0010914  0.00065376 0.00053428
 0.00093926 0.00067285 0.00086375 0.00102802 0.00060578 0.00173141
 0.00684521 0.00070499 0.00050233 0.00054504 0.00052275 0.00045634
 0.00134417 0.00459004 0.00060823 0.00058205 0.001067   0.00062061
 0.00054693 0.00113608 0.00690824 0.00381467 0.00098319 0.00079215
 0.00415323 0.00094046 0.00166385 0.00116077 0.00096595 0.0008926
 0.00097181 0.00071386 0.00070076 0.00065067 0.00264815 0.00179797
 0.00061808 0.00046481 0.00061474 0.0005897  0.00205432 0.00043104
 0.00038956 0.00097699 0.00057125 0.00050097 0.00475183 0.00075185
 0.00159505 0.0013878  0.00072402 0.0032366  0.0010579  0.00186023
 0.00085914 0.00365674 0.00060932 0.00110362 0.0009037  0.00060107
 0.00109702 0.00201554 0.00088818 0.00177    0.00230438 0.00060168
 0.00437501 0.00094664 0.00180724 0.00139438 0.00090991 0.00189076
 0.00093883 0.00117496 0.00135589 0.00128304 0.01119274 0.00173013
 0.00108785 0.00066931 0.00387838 0.00148438 0.00069254 0.0006568
 0.00044375 0.00076008 0.00068554 0.00155672 0.00060751 0.00046874
 0.00056063 0.00167375 0.00065509 0.00072051 0.00066415 0.00072419
 0.00082454 0.00054486 0.00045343 0.00069155 0.00049277 0.00046386
 0.00030047 0.0004769  0.00053807 0.00061284 0.00045331 0.00050338
 0.00048455 0.00049045 0.00048583 0.00039619 0.00048068 0.00056404
 0.00045649 0.00049807 0.0004476  0.00066544 0.00078638 0.00131371
 0.00060537 0.00057138 0.0004336  0.0004147  0.00056396 0.00055417
 0.00051672 0.00052329 0.00047175 0.00203486 0.00089382 0.00060031
 0.00059987 0.00060181 0.00114545 0.00055001 0.00072426 0.00042265
 0.00236191 0.00075691 0.00040871 0.00185986 0.00088442 0.00043235
 0.00075746 0.00086747 0.0005696  0.00076044 0.00052714 0.00342677
 0.00067977 0.00054517 0.00103546 0.00082098 0.00051772 0.00063276
 0.00077017 0.00060814 0.00046105 0.00066417 0.00055127 0.00057197
 0.00148304 0.00065776 0.0004933  0.00039011 0.00060747 0.00052778
 0.00039206 0.00384451 0.00039337 0.00102246 0.0007918  0.00057602
 0.00032997 0.00106976 0.0004305  0.00042372 0.00058828 0.00067668
 0.00177687 0.00045052 0.00090498 0.00066602 0.00068188 0.00053053
 0.00064152 0.00031669 0.00054314 0.00035669 0.0008402  0.00044964
 0.00055899 0.00045394 0.00040204 0.00077508 0.00135424 0.00070711
 0.00042899 0.00073185 0.00044124 0.00041545 0.00070486 0.00056407
 0.00052535 0.00117377 0.00065302 0.00043444 0.000759   0.00041401
 0.00045835 0.00057571 0.00051076 0.00071106 0.00045357 0.0006442
 0.00055757 0.00053716 0.00042428 0.00045759 0.00049761 0.00053119
 0.0005204  0.00047922 0.00045042 0.00057014 0.00037927 0.00062553
 0.00047032 0.0004471  0.00050842 0.00040305 0.00059871 0.00041915
 0.00045843 0.00055262 0.00043224 0.00086933 0.00046888 0.00045559
 0.00046531 0.00042935 0.00042314 0.00052053 0.00037858 0.00036995
 0.00055344 0.00046626 0.00049141 0.00051588 0.00055633 0.00053725
 0.00047175 0.00043913 0.00093012 0.00039186 0.00054678 0.00048988
 0.0005382  0.00046831 0.00042162 0.00043037 0.00044103 0.00047019
 0.00050098 0.00054089 0.0004234  0.00043488 0.00046107 0.00060425
 0.00046603 0.00058242 0.00044285 0.00036235 0.00050451 0.00049373
 0.00037504 0.00040649 0.00045363 0.00042852 0.00038095 0.00037889
 0.00054208 0.00047065 0.00055819 0.00082124 0.00043254 0.00035073
 0.00041461 0.00041722 0.0005041  0.00040132 0.0005712  0.00055622
 0.00074771 0.00058211 0.00036429 0.00045066 0.00049865 0.00065056
 0.0005079  0.00049739 0.00062886 0.00047953 0.00037884 0.00044671
 0.00042038 0.00051152 0.00040804 0.00046951 0.00047351 0.00046245
 0.00043523 0.00053594 0.00037684 0.00052568 0.00054374 0.00056837
 0.00056615 0.00078979 0.00032377 0.00040263 0.00047492 0.00049456
 0.00042062 0.00050711 0.00045294 0.00053827 0.00058312 0.00052222
 0.0003461  0.00041605 0.00043773 0.00043503 0.00046172 0.00047388
 0.00036075 0.00045466 0.00043182 0.00046299 0.00053667 0.00048153
 0.0004796  0.00032893 0.00049793 0.00043586 0.00046089 0.00043909
 0.00047251 0.00037879 0.00039508 0.0006204  0.00047516 0.0004039
 0.00054548 0.00039939 0.00040089 0.00039647 0.00046816 0.00064493
 0.00043811 0.00042229 0.00047219 0.00032981 0.00042395 0.00040504
 0.00055575 0.00045233 0.00038079 0.00044759 0.00047355 0.00047244
 0.00038336 0.00042553 0.00033414 0.00043394 0.00063382 0.00044478
 0.00056546 0.00038854 0.00043654 0.00045097 0.00044877 0.0004271
 0.00044131 0.00031456 0.00046639 0.00038678 0.00055124 0.00028441
 0.00052403 0.00038867 0.00061439 0.00059023 0.00051487 0.00032175
 0.00058876 0.00031377 0.00065868 0.00034836 0.00037215 0.0005727
 0.00039767 0.00050772 0.00069996 0.00045412 0.00045022 0.00040029
 0.00042713 0.00061628 0.00036973 0.00043879 0.00036537 0.00030508
 0.00044016 0.00078747 0.00051579 0.00033836 0.00039105 0.00059108
 0.00054752 0.00041032 0.0005169  0.00069376 0.00063046 0.00051236
 0.00037725 0.00068524 0.00050268 0.00061998 0.00047021 0.00048317
 0.00052294 0.00053754 0.0004783  0.0005621  0.00046651 0.00048156
 0.0005337  0.00032022 0.00061965 0.00041965 0.0005098  0.0004917
 0.00049733 0.00055372 0.00057612 0.00052536 0.00037732 0.00059536
 0.00041714 0.00041961 0.0003624  0.00057209 0.00034381 0.00044579
 0.00051586 0.00039663 0.00071595 0.00052519 0.00055738 0.00042017
 0.00041026 0.00055984 0.00035082 0.00046102 0.00039487 0.00042902
 0.00062455 0.00054678 0.00045825 0.00037114 0.00040797 0.00044308
 0.00044061 0.00058309 0.00056002 0.00045243 0.00032297 0.00031919
 0.00033797 0.00038011 0.00037747 0.0003617  0.00057709 0.00039837
 0.00045427 0.00040537 0.00048965 0.00054909 0.00038214 0.00041778
 0.00043289 0.00042835 0.00049557 0.00051244 0.00046318 0.0004056
 0.00041031 0.00038141 0.00047444 0.00043974 0.00037476 0.00036134
 0.00032671 0.00054556 0.00039214 0.00050201 0.00045747 0.00041902
 0.00043253 0.00047241 0.0004119  0.00038622 0.00050233 0.00042848
 0.00043948 0.0004442  0.00039755 0.00048347 0.00048148 0.00031899
 0.00039389 0.00035602 0.00053936 0.00042077 0.00061889 0.000386
 0.00046729 0.00037702 0.00036526 0.00044787 0.00045822 0.00040826
 0.00049037 0.0004458  0.00042419 0.00037053 0.00054917 0.0003602
 0.00048142 0.0004795  0.00046837 0.0004291  0.00040798 0.00045988
 0.00037559 0.00038345 0.00037479 0.00044818 0.00045333 0.00039435
 0.00037874 0.00048985 0.0004038  0.00043736 0.00037463 0.0003931
 0.0004233  0.0005212  0.00043266 0.00038122 0.00035645 0.00036388
 0.00035413 0.00030773 0.00041858 0.00050964 0.00031148 0.0004099
 0.00048401 0.00043114 0.00048213 0.00046617 0.00045688 0.00040277
 0.00036055 0.00039382 0.00046754 0.00040982 0.00049309 0.00038646
 0.00033985 0.00045522 0.00055988 0.00035776 0.00044065 0.00029988
 0.00032753 0.00048493 0.000414   0.0003644  0.00033685 0.00040877
 0.00039934 0.00036964 0.00042756 0.00048326 0.00045033 0.00041154
 0.00041201 0.00053677 0.00041422 0.00036513 0.00047225 0.00041451
 0.00036835 0.00042879 0.00045731 0.00045082 0.00045776 0.0003839
 0.00035677 0.00049245 0.00041646 0.0004102  0.0004068  0.00042715
 0.00029613 0.00033614 0.00048362 0.00040997 0.00044753 0.00035816
 0.00042631 0.00041219 0.00049932 0.00036409 0.00034608 0.00044017
 0.00037566 0.00044554 0.00050801 0.00044757 0.0004577  0.00035813
 0.00037306 0.0005499  0.00042208 0.00040454 0.00053057 0.00028407
 0.00030356 0.00038941 0.00032431 0.00035801 0.00037545 0.00046236
 0.00055698 0.00045635 0.00041683 0.00028438 0.00031264 0.00046115
 0.00046821 0.00041125 0.00044694 0.00033223 0.00047251 0.00047906
 0.00050281 0.00042821 0.00041932 0.00042486 0.00042007 0.00040669
 0.00031386 0.00044459 0.00043993 0.00047744 0.00050077 0.00046564
 0.00046778 0.00044584 0.00035969 0.00043309 0.00044199 0.00040411
 0.00039904 0.00052408 0.00035861 0.00041926 0.00043548 0.0004615
 0.00039658 0.00043296 0.00033442 0.00046143 0.00050813 0.00039816
 0.00044554 0.00030297 0.00039987 0.0003612  0.00045544 0.0003402
 0.00048385 0.00038792 0.00045056 0.00039033 0.00047129 0.00046633
 0.00047721 0.00038866 0.00047778 0.00040715 0.00047695 0.00038039
 0.0004288  0.00041936 0.00030967 0.00035194 0.00039722 0.000382
 0.00032951 0.00040688 0.00034731 0.00047104 0.00053974 0.0003595
 0.00034621 0.0004059  0.00039647 0.00045877]
Training accuracy: 0.9760275073971344
No description has been provided for this image
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo            2045             180
Actual Norm             202            1244
Accuracy: 0.8959411604467448
Classification report:
               precision    recall  f1-score   support

        Hypo       0.91      0.92      0.91      2225
        Norm       0.87      0.86      0.87      1446

    accuracy                           0.90      3671
   macro avg       0.89      0.89      0.89      3671
weighted avg       0.90      0.90      0.90      3671

Multilayer perceptron¶

The Multi-layer Perceptron (MLP) is a feedforward neural network trained via backpropagation. It consists of one or more fully connected hidden layers with non-linear activation functions.

  • Loss: Cross-entropy
  • Hidden layer activation: ReLU
  • Output activation: Sigmoid for binary classification

MLPs may outperform traditional ML models when:

  • There are complex nonlinear relationships that tree models or SVMs cannot easily capture
  • The data is large
  • There are enough training examples to avoid overfitting (since MLPs have many parameters)

Because of the last point, although we will train MLPs on both the Smart-seq and Drop-seq data sets, significant results are only expected on Drop-seq due to the much larger size compared to Smart-seq. The models trained on Smart-seq are simply for comparison purposes later on.

Key Hyperparameters¶

  • hidden_layer_sizes: Tuple specifying the number of neurons in each hidden layer.

    • Example: (100, 50) means two hidden layers, the first with 100 neurons and the second with 50.
  • alpha: L2 regularization parameter that penalizes large weights to prevent overfitting.

    • A larger $\alpha$ increases regularization, encouraging the model to use smaller weights.

The regularized loss function becomes:

$$ L_{\text{total}} = L_{\text{data}} + \alpha \sum_i ||W^{(i)}||^2 $$

Training¶

In [ ]:
def train_mlp(
    X_train,
    y_train,
    random_state: int | None = None,
    n_jobs: int | None = None,
    verbose: bool = True
):
    if verbose:
        print("========================= Training =========================")
    
    params = {
        "hidden_layer_sizes": [(200,), (100, 50), (100, 100), (200, 100, 50)],
        "alpha": [1e-4, 1e-3, 1e-2, 1e-1],  # L2 regularization strength
    }
    
    model = GridSearchCV(
        estimator = MLPClassifier(
            max_iter = 500,
            random_state = random_state,
            early_stopping = True,
            n_iter_no_change = 10,
            verbose = False,
        ),
        param_grid = params,
        refit = True,
        cv = 5,
        n_jobs = n_jobs,
        return_train_score = True
    )
    
    model.fit(X_train, y_train)

    if verbose:
        summarize_crossvalidation(model)
        print("Training accuracy:", model.score(X_train, y_train))
        
        plot_learning_curve(model, list(params.keys()), log_scale_params=["alpha"])
    
    return model.best_estimator_

Evaluation¶

In [ ]:
def train_test_mlp(
    X,
    y,
    test_size: float = 0.25,
    train_size: float | None = None,
    random_state: int = 10,
    n_jobs: int | None = None,
    verbose: bool = True
):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size = test_size, train_size = train_size, random_state = random_state, stratify = y
    )
    
    if verbose:
        print("Training data dimensions:", X_train.shape)
        print("Testing data dimensions:", X_test.shape)
    
    model = train_mlp(X_train = X_train, y_train = y_train, random_state = random_state, n_jobs = n_jobs, verbose = verbose)
    accuracy = test_model(model = model, X_test = X_test, y_test = y_test, verbose = verbose)
    
    return TrainedModelWrapper(
        model = model,
        X = X,
        y = y,
        X_train = X_train,
        y_train = y_train,
        X_test = X_test,
        y_test = y_test,
        accuracy = accuracy
    )
In [ ]:
ss_mcf7_pca_mlp = train_test_mlp(X_pca_ss_mcf7, y_pca_ss_mcf7, n_jobs = -1)
Training data dimensions: (187, 20)
Testing data dimensions: (63, 20)
========================= Training =========================
Best Parameters: {'alpha': 0.0001, 'hidden_layer_sizes': (100, 100)}
Best Score (CV avg): 0.9677098150782362
Max Iterations: 500
Number of iterations for convergence: 14
Training accuracy: 0.9893048128342246
No description has been provided for this image
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo              31               0
Actual Norm               0              32
Accuracy: 1.0
Classification report:
               precision    recall  f1-score   support

        Hypo       1.00      1.00      1.00        31
        Norm       1.00      1.00      1.00        32

    accuracy                           1.00        63
   macro avg       1.00      1.00      1.00        63
weighted avg       1.00      1.00      1.00        63

In [ ]:
ss_hcc_pca_mlp = train_test_mlp(X_pca_ss_hcc, y_pca_ss_hcc, n_jobs = -1)
Training data dimensions: (136, 34)
Testing data dimensions: (46, 34)
========================= Training =========================
Best Parameters: {'alpha': 0.0001, 'hidden_layer_sizes': (100, 100)}
Best Score (CV avg): 0.8968253968253969
Max Iterations: 500
Number of iterations for convergence: 22
Training accuracy: 0.9705882352941176
No description has been provided for this image
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo              24               1
Actual Norm               1              20
Accuracy: 0.9565217391304348
Classification report:
               precision    recall  f1-score   support

        Hypo       0.96      0.96      0.96        25
        Norm       0.95      0.95      0.95        21

    accuracy                           0.96        46
   macro avg       0.96      0.96      0.96        46
weighted avg       0.96      0.96      0.96        46

As expected, the models trained on SmartSeq do not perform particularly well.

In [ ]:
ds_mcf7_pca_mlp = train_test_mlp(X_pca_ds_mcf7, y_pca_ds_mcf7, n_jobs = -1)
Training data dimensions: (16219, 761)
Testing data dimensions: (5407, 761)
========================= Training =========================
Best Parameters: {'alpha': 0.1, 'hidden_layer_sizes': (200,)}
Best Score (CV avg): 0.9808249428818134
Max Iterations: 500
Number of iterations for convergence: 13
Training accuracy: 0.990381651149886
No description has been provided for this image
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo            2157              73
Actual Norm              28            3149
Accuracy: 0.9813205104494174
Classification report:
               precision    recall  f1-score   support

        Hypo       0.99      0.97      0.98      2230
        Norm       0.98      0.99      0.98      3177

    accuracy                           0.98      5407
   macro avg       0.98      0.98      0.98      5407
weighted avg       0.98      0.98      0.98      5407

In [ ]:
ds_hcc_pca_mlp = train_test_mlp(X_pca_ds_hcc, y_pca_ds_hcc, n_jobs = -1)
Training data dimensions: (11011, 844)
Testing data dimensions: (3671, 844)
========================= Training =========================
Best Parameters: {'alpha': 0.1, 'hidden_layer_sizes': (100, 100)}
Best Score (CV avg): 0.9560442102112428
Max Iterations: 500
Number of iterations for convergence: 39
Training accuracy: 0.9964580873671782
No description has been provided for this image
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo            2119             106
Actual Norm              61            1385
Accuracy: 0.954508308362844
Classification report:
               precision    recall  f1-score   support

        Hypo       0.97      0.95      0.96      2225
        Norm       0.93      0.96      0.94      1446

    accuracy                           0.95      3671
   macro avg       0.95      0.96      0.95      3671
weighted avg       0.96      0.95      0.95      3671

Feature selection¶

Feature selection methods can be employed to reduce the number of dimensions without transforming the data (i.e. instead of PCA), thus maintaining the interpretability of each gene before training a model on the data. Furthermore, feature selection can help reduce noise and improve the generalizability of the model. They can also be used in conjunction with PCA to identify the principle components which are more important for classification.

In [ ]:
X_ss_mcf7 = ss_mcf7_norm.T.iloc[:]
y_ss_mcf7 = ["Hypo" if "hypo" in name.lower() else "Norm" for name in ss_mcf7_norm.columns]

X_ss_hcc = ss_hcc_norm.T.iloc[:]
y_ss_hcc = ["Hypo" if "hypo" in name.lower() else "Norm" for name in ss_hcc_norm.columns]

X_ds_mcf7 = ds_mcf7_norm.T.iloc[:]
y_ds_mcf7 = ["Hypo" if "hypo" in name.lower() else "Norm" for name in ds_mcf7_norm.columns]

X_ds_hcc = ds_hcc_norm.T.iloc[:]
y_ds_hcc = ["Hypo" if "hypo" in name.lower() else "Norm" for name in ds_hcc_norm.columns]

Feature selection functions¶

In [ ]:
def get_selected_features(
    pipeline: Pipeline,
    X_train,
    step_names: list[str]
) -> list[str]:
    feature_names = X_train.columns
    
    for name in step_names:
        selector = pipeline.named_steps[name]
        mask = selector.get_support()
        feature_names = feature_names[mask]
        
    return feature_names.to_list()
In [ ]:
def get_selected_pcs_from_model(estimator: BaseEstimator, verbose: bool = True):
    selector = SelectFromModel(estimator, prefit = True)
    mask = selector.get_support()
    pcs = [i + 1 for i in range(len(mask)) if mask[i]]
    
    if verbose:
        print(f"Top {len(pcs)} principal components:")
        print(pcs)
    
    return pcs
In [ ]:
def count_and_sort_occurrences(feature_lists: list[list[str]], verbose: bool = True):
    top_features = []
    for feature_list in feature_lists:
        top_features += feature_list
    
    top_features = np.array(top_features)
    unique_features, feature_counts = np.unique(top_features, return_counts = True)
    
    top_features = np.asarray((unique_features, feature_counts)).T
    top_features = top_features[top_features[:, 1].argsort()][::-1]
    
    if verbose:
        print("Feature | Occurrences")
        print(top_features)
    
    return top_features
In [ ]:
def filter_by_occurrences(feature_list: np.ndarray, n_occurrences: int):
    return [feature[0].item() for feature in feature_list if int(feature[1]) == n_occurrences]

Recursive feature elimination (RFE) with grid-search cross-validation provides comprehensive feature selection with a more compact and interpretable set of genes. However, this comprehensive feature selection is very computationally intensive. To reduce the training time, the SelectKBest selector uses ANOVA to select the k best features. This set of features is further reduced by training (linear) SVM on the data and identifying the most important features in the model. This SVM-based selection works well for linear models like SVM and logistic regression.

In [ ]:
def train_feature_selection_svm_rfe(
    X_train,
    y_train,
    estimator: BaseEstimator,
    estimator_params: dict[str, list],
    k: int | None = 1_000,
    random_state: int | None = None,
    n_jobs: int | None = None,
    verbose: bool = True
):
    """Train a feature selection pipeline using ANOVA, SVM, and RFE.

    Returns:
        tuple[Pipeline, list[str]]: Best pipeline and list of selected features.
    """
    if verbose:
        print("========================= Training =========================")
    
    n_samples = X_train.shape[0]
    params = {f"estimator__{param}": options for param, options in estimator_params.items()}
    
    if hasattr(estimator, "random_state") and random_state is not None:
        estimator.set_params(random_state = random_state)
        
    if hasattr(estimator, "n_jobs") and n_jobs is not None:
        estimator.set_params(n_jobs = n_jobs)
        
    univariate_selector = SelectKBest(k = k)
    svm_selector = SelectFromModel(LinearSVC(C = 0.025, random_state = random_state, max_iter = 10_000))
    rfe_selector = RFECV(estimator)
    
    pipeline = Pipeline([
        ("univariate", univariate_selector),
        ("svm", svm_selector),
        ("rfe", rfe_selector),
        ("estimator", estimator)
    ])
    
    pipeline = GridSearchCV(
        estimator = pipeline,
        param_grid = params,
        refit = True,
        scoring = "f1_macro",
        cv = 5,
        n_jobs = n_jobs,
        return_train_score = True
    ) if n_samples < 10_000 else RandomizedSearchCV(
        estimator = pipeline,
        param_distributions = params,
        random_state = random_state,
        refit = True,
        scoring = "f1_macro",
        cv = 5,
        n_jobs = n_jobs,
        return_train_score = True,
        
    )
    pipeline.fit(X_train, y_train)
    summarize_crossvalidation(pipeline)
    
    print("Training accuracy:", pipeline.score(X_train, y_train))
    
    plot_learning_curve(pipeline, list(params.keys()))
    
    best_pipeline: Pipeline = pipeline.best_estimator_
    selected_features = get_selected_features(best_pipeline, X_train, ["univariate", "svm", "rfe"])
    
    if verbose:
        print("Number of selected genes:", len(selected_features))
        print("Selected genes:", selected_features)
    
    return best_pipeline, selected_features
In [ ]:
def feature_selection_svm_rfe(
    X,
    y,
    estimator: BaseEstimator,
    estimator_params: dict[str, list],
    test_size: float = 0.25,
    train_size: float | None = None,
    random_state: int | None = 10,
    n_jobs: int | None = None,
    verbose: bool = True
):
    """Train and test a feature selection pipeline using ANOVA, SVM, and RFE.

    Returns:
        tuple[Pipeline, list[str], float]: Best pipeline, list of selected features, test accuracy.
    """
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, train_size = train_size, random_state = random_state, stratify = y)
    
    if verbose:
        print("Training data dimensions:", X_train.shape)
        print("Testing data dimensions:", X_test.shape)
    
    pipeline, selected_features = train_feature_selection_svm_rfe(
        X_train = X_train,
        y_train = y_train,
        estimator = estimator,
        estimator_params = estimator_params,
        random_state = random_state,
        n_jobs = n_jobs,
        verbose = verbose
    )
    
    accuracy = test_model(pipeline, X_test, y_test, verbose)
    
    return pipeline, selected_features, accuracy

Since random forest is not a linear model like SVM and logistic regression, an SVM pre-selector doesn't align well with the model. Random forest captures non-linear and interaction effects which may not be picked up on since strong linear variables get selected.

In [ ]:
def train_feature_selection_random_forest(
    X_train,
    y_train,
    estimator: BaseEstimator,
    estimator_params: dict[str, list],
    k: int | None = 500,
    random_state: int | None = None,
    n_jobs: int | None = None,
    verbose: bool = True
):
    """Train a feature selection pipeline using ANOVA and random forest.

    Returns:
        tuple[Pipeline, list[str]]: Best pipeline and list of selected features.
    """
    if verbose:
        print("========================= Training =========================")
    
    n_samples = X_train.shape[0]
    params = {f"estimator__{param}": options for param, options in estimator_params.items()}
    
    if hasattr(estimator, "random_state") and random_state is not None:
        estimator.set_params(random_state = random_state)
        
    if hasattr(estimator, "n_jobs") and n_jobs is not None:
        estimator.set_params(n_jobs = n_jobs)
        
    univariate_selector = SelectKBest(k = k)
    random_forest_selector = SelectFromModel(RandomForestClassifier())
    
    pipeline = Pipeline([
        ("univariate", univariate_selector),
        ("random_forest", random_forest_selector),
        ("estimator", estimator)
    ])
    
    pipeline = GridSearchCV(
        estimator = pipeline,
        param_grid = params,
        refit = True,
        scoring = "f1_macro",
        cv = 5,
        n_jobs = n_jobs,
        return_train_score = True
    ) if n_samples < 10_000 else RandomizedSearchCV(
        estimator = pipeline,
        param_distributions = params,
        random_state = random_state,
        refit = True,
        scoring = "f1_macro",
        cv = 5,
        n_jobs = n_jobs,
        return_train_score = True,
    )
    pipeline.fit(X_train, y_train)
    summarize_crossvalidation(pipeline)
    
    print("Training accuracy:", pipeline.score(X_train, y_train))
    
    plot_learning_curve(pipeline, list(params.keys()))
    
    best_pipeline: Pipeline = pipeline.best_estimator_
    selected_features = get_selected_features(best_pipeline, X_train, ["univariate", "random_forest"])
    
    if verbose:
        print("Number of selected genes:", len(selected_features))
        print("Selected genes:", selected_features)
    
    return best_pipeline, selected_features
In [ ]:
def feature_selection_random_forest(
    X,
    y,
    estimator: BaseEstimator,
    estimator_params: dict[str, list],
    test_size: float = 0.25,
    train_size: float | None = None,
    random_state: int | None = 10,
    n_jobs: int | None = None,
    verbose: bool = True
):
    """Train and test a feature selection pipeline using ANOVA, SVM, and RFE.

    Returns:
        tuple[Pipeline, list[str], float]: Best pipeline, list of selected features, test accuracy.
    """
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, train_size = train_size, random_state = random_state, stratify = y)
    
    if verbose:
        print("Training data dimensions:", X_train.shape)
        print("Testing data dimensions:", X_test.shape)
    
    pipeline, selected_features = train_feature_selection_random_forest(
        X_train = X_train,
        y_train = y_train,
        estimator = estimator,
        estimator_params = estimator_params,
        random_state = random_state,
        n_jobs = n_jobs,
        verbose = verbose
    )
    
    accuracy = test_model(pipeline, X_test, y_test, verbose)
    
    return pipeline, selected_features, accuracy

Logistic regression¶

Use the feature selection pipeline to select genes from the raw data.

In [ ]:
ss_mcf7_logit, ss_mcf7_logit_features, ss_mcf7_logit_accuracy = feature_selection_svm_rfe(
    X = X_ss_mcf7,
    y = y_ss_mcf7,
    estimator = LogisticRegression(max_iter = 10_000),
    estimator_params = {"C": [0.1, 1, 2]},
    n_jobs = -1
)
Training data dimensions: (187, 3000)
Testing data dimensions: (63, 3000)
========================= Training =========================
Best Parameters: {'estimator__C': 1}
Best Score (CV avg): 0.9893277893277894
Training accuracy: 1.0
No description has been provided for this image
Number of selected genes: 24
Selected genes: ['CYP1B1', 'DDIT4', 'TUBA1B', 'GFRA1', 'MT-CYB', 'SLC9A3R1', 'XBP1', 'MT-CO3', 'EMP2', 'MT-CO2', 'SLC39A6', 'PGK1', 'LDHA', 'STARD10', 'MT-CO1', 'SCD', 'FLNA', 'MT-ATP6', 'DHCR7', 'SULF2', 'GATA3', 'DDX5', 'NME1-NME2', 'ALDOA']
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo              31               0
Actual Norm               0              32
Accuracy: 1.0
Classification report:
               precision    recall  f1-score   support

        Hypo       1.00      1.00      1.00        31
        Norm       1.00      1.00      1.00        32

    accuracy                           1.00        63
   macro avg       1.00      1.00      1.00        63
weighted avg       1.00      1.00      1.00        63

In [ ]:
ss_hcc_logit, ss_hcc_logit_features, ss_hcc_logit_accuracy = feature_selection_svm_rfe(
    X = X_ss_hcc,
    y = y_ss_hcc,
    estimator = LogisticRegression(max_iter = 10_000),
    estimator_params = {"C": [0.1, 1, 2]},
    n_jobs = -1
)
Training data dimensions: (136, 3000)
Testing data dimensions: (46, 3000)
========================= Training =========================
Best Parameters: {'estimator__C': 0.1}
Best Score (CV avg): 0.9775925925925926
Training accuracy: 1.0
No description has been provided for this image
Number of selected genes: 166
Selected genes: ['DDIT4', 'ANGPTL4', 'CCNB1', 'IGFBP3', 'AKR1C2', 'NDRG1', 'KRT4', 'FN1', 'MMP1', 'SPP1', 'EGLN3', 'CA9', 'CDC20', 'AURKA', 'PLIN2', 'UPK1B', 'AKR1C1', 'FOS', 'LAMB3', 'LY6D', 'H4C3', 'AKR1C3', 'TPX2', 'PLAU', 'CXCL1', 'FAM83A', 'BNIP3', 'INSIG1', 'KRT19', 'BHLHE40', 'TXNIP', 'THBS1', 'ALDOC', 'ID3', 'CEACAM5', 'FTH1', 'GPRC5A', 'CCNB2', 'KPNA2', 'FTL', 'PLK2', 'DKK1', 'KCTD11', 'SLC2A1', 'CLDN4', 'KIF23', 'PGK1', 'SLC6A8', 'KIF2C', 'LOXL2', 'CHAC1', 'SPAG5', 'F3', 'WTAPP1', 'CSTB', 'HSPA5', 'DHCR7', 'HERPUD1', 'FGFBP1', 'CDKN1A', 'PFKFB3', 'DHRS3', 'LDHA', 'SLCO4A1', 'KDM5B', 'KRT8', 'PRC1', 'ADM', 'KNSTRN', 'FDFT1', 'CKS2', 'TMSB10', 'SLC38A2', 'CD44', 'FOSL2', 'JUP', 'KYNU', 'ALDH1A3', 'S100A2', 'KRT18', 'ZWINT', 'PRSS23', 'HBP1', 'SQSTM1', 'MYC', 'JUNB', 'H1-0', 'C10orf55', 'MSMO1', 'ERO1A', 'SRXN1', 'CKAP2', 'TFRC', 'SEMA4B', 'ITGA6', 'EIF5', 'P4HA1', 'TRIM29', 'SLC20A1', 'TRIM16', 'CDC6', 'IRF6', 'HMGCS1', 'GPX2', 'GPI', 'HSPA8', 'ISG15', 'ALDOA', 'CAV1', 'BIRC5', 'TXN', 'TUBB', 'PCDH1', 'TUBB4B', 'MT-CO3', 'ACAT2', 'POLR2A', 'IER2', 'AMOTL2', 'FSCN1', 'MT-CYB', 'BLCAP', 'PLOD2', 'TUBA1B', 'HES1', 'NQO1', 'DCBLD2', 'HSP90B1', 'FYB1', 'UGDH', 'LMNA', 'MRNIP', 'HRH1', 'PCNA', 'PRNP', 'BNIP3L', 'TPBG', 'C4orf3', 'MT-RNR1', 'EGLN1', 'PRDX1', 'UBB', 'HMGA1', 'PSMD2', 'NUP188', 'HSP90AA1', 'NRP1', 'SRM', 'HSPH1', 'BAG3', 'MIF-AS1', 'MIF', 'HLA-A', 'LDHB', 'CDK2AP2', 'PERP', 'EIF4A2', 'PPIF', 'FUT11', 'FAM162A', 'TYSND1', 'CLDN7', 'P4HA2', 'CANX', 'NOLC1', 'VCPIP1']
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo              24               1
Actual Norm               0              21
Accuracy: 0.9782608695652174
Classification report:
               precision    recall  f1-score   support

        Hypo       1.00      0.96      0.98        25
        Norm       0.95      1.00      0.98        21

    accuracy                           0.98        46
   macro avg       0.98      0.98      0.98        46
weighted avg       0.98      0.98      0.98        46

In [ ]:
ds_mcf7_logit, ds_mcf7_logit_features, ds_mcf7_logit_accuracy = feature_selection_svm_rfe(
    X = X_ds_mcf7,
    y = y_ds_mcf7,
    estimator = LogisticRegression(max_iter = 10_000),
    estimator_params = {"C": [0.1, 1, 2]},
    n_jobs = -1
)
Training data dimensions: (16219, 3000)
Testing data dimensions: (5407, 3000)
========================= Training =========================
Best Parameters: {'estimator__C': 0.1}
Best Score (CV avg): 0.9732560929550514
Training accuracy: 0.9826839329314294
No description has been provided for this image
Number of selected genes: 377
Selected genes: ['MT-RNR2', 'TFF1', 'MT-RNR1', 'GDF15', 'MT-CO3', 'MT-ND4', 'MT-ND3', 'MT-CYB', 'IGFBP5', 'TMSB10', 'MT-ATP6', 'MT-CO2', 'MT-TS1', 'MT-ND6', 'MT-ND2', 'MTND1P23', 'MT-ND4L', 'MT-ND5', 'MT-TN', 'MT-ND1', 'MT-TA', 'MT-TQ', 'HES1', 'MT-TM', 'LGALS1', 'TMEM64', 'MTND2P28', 'MT-TE', 'H19', 'MT-ATP8', 'H2AC12', 'NCDN', 'MT-TY', 'TOB1', 'H2AC20', 'MT-TP', 'ANKRD52', 'MT-TD', 'C16orf91', 'ATN1', 'WSB2', 'GPM6A', 'ZFP36', 'VMP1', 'TFF3', 'KMT2D', 'FGF23', 'CRTC2', 'CSK', 'PLBD2', 'ITPK1', 'PLEC', 'GOLGA4', 'PLCD3', 'PTP4A2', 'TIAM1', 'SOX4', 'BTBD9', 'H2AC11', 'CBFA2T3', 'PROSER1', 'ARF3', 'PARD6B', 'RPL13', 'TPI1', 'BTN3A2', 'GREM1', 'RNF146', 'S100A10', 'CHAC2', 'ATXN2L', 'TGFB3', 'MGRN1', 'CAPZA1', 'FAM189B', 'GSE1', 'CERS2', 'ENO1', 'SLC48A1', 'PKIB', 'RHOD', 'BLOC1S3', 'KRT19', 'RPL34', 'TCF20', 'LINC01291', 'FAM102A', 'PRRG3', 'GABPB2', 'CAMK2N1', 'VPS9D1-AS1', 'TAF13', 'INCENP', 'ZNRF1', 'NINJ1', 'ZBTB34', 'DSP', 'ZNF480', 'CALHM2', 'MSMB', 'KCNJ2', 'ZBTB20', 'TPD52L1', 'HSPH1', 'HMGA1', 'CASP8AP2', 'ZNF302', 'ELOA', 'GPATCH4', 'SNX24', 'DVL3', 'SNX27', 'YTHDF3', 'GAB2', 'PACS1', 'NLK', 'THAP1', 'KCNJ3', 'LDLRAP1', 'TRAK2', 'CAMSAP2', 'PPM1G', 'NCALD', 'LRRFIP2', 'DNAJA1', 'SMKR1', 'MAPKAPK2', 'ZNF702P', 'NACC1', 'TRIM37', 'RFK', 'FBXL16', 'TCHP', 'ISCU', 'RABEP1', 'CACNG4', 'RPSAP48', 'WWC3', 'GDAP2', 'SRCAP', 'USP32', 'FLOT2', 'MAFF', 'NCOA1', 'TWNK', 'AKAP5', 'NEDD4L', 'APOOL', 'CCDC18', 'RAB27A', 'BRPF3', 'BCAS3', 'GATAD2A', 'NSD1', 'NPM1P40', 'ANKRD40', 'ILRUN', 'PSMD14', 'STRBP', 'TPM1', 'CAV1', 'MPHOSPH9', 'ANXA6', 'PRXL2C', 'CDC25B', 'KIF14', 'PYGO2', 'ZNF688', 'KHSRP', 'BAP1', 'MDM2', 'RAB5C', 'PAQR8', 'SOS1', 'KRT80', 'SECISBP2L', 'BOLA3', 'DNAJA4', 'THRB', 'ARPP19', 'S100A11', 'FRS2', 'RGPD4-AS1', 'BRIP1', 'PRR12', 'TEDC2-AS1', 'RPL15', 'DKC1', 'C9orf78', 'NBEAL2', 'SETD3', 'FEM1A', 'SLC25A24', 'ARMC6', 'SLC13A5', 'CFAP97', 'NEDD1', 'PHLDA2', 'MARK3', 'SPATS2L', 'PAPOLA', 'MT2A', 'ZNF354A', 'SET', 'ATXN1L', 'SCYL2', 'ZNF703', 'SRFBP1', 'UBA52', 'MGLL', 'LAD1', 'ZC3H15', 'SLC25A48', 'RAD23A', 'EIF4G2', 'HOXC13', 'PITPNA', 'TAF9B', 'LXN', 'SERINC5', 'FBRS', 'SMC5', 'RAI14', 'TRIM44', 'MYO5C', 'AKT1S1', 'TBKBP1', 'EIF2B4', 'PRR34-AS1', 'PSME4', 'PDAP1', 'ARHGAP26', 'ELP3', 'SENP6', 'DNAJC21', 'FAM104A', 'CS', 'ABL1', 'EIF3A', 'H2AX', 'MARK2', 'LCLAT1', 'S100P', 'RCC1L', 'ANKRD17', 'TMEM259', 'RAB1B', 'GAPDH', 'TMEM258', 'SSX2IP', 'PDS5A', 'FAM177A1', 'NAA10', 'CNOT9', 'PGK1', 'PKM', 'KLHL8', 'BCL3', 'PRMT6', 'CACNA1A', 'GOLGA3', 'SOCS2', 'PPP1R12B', 'DCTN1', 'C7orf50', 'ZMIZ1', 'PGAM5', 'RPL30', 'ARNTL2', 'PREX1', 'LYAR', 'PRRC2C', 'PCYT1A', 'GLE1', 'ZFC3H1', 'BMPR1B', 'RBBP6', 'ZNF764', 'RAB35', 'ENOX2', 'LMNB2', 'ZNF326', 'ARID1B', 'TIMELESS', 'PFDN4', 'LPP', 'SYNE2', 'ZRANB1', 'PLCB4', 'CBX3', 'NOL4L', 'SPRY1', 'RPS6KA6', 'CKS2', 'SMC6', 'AURKA', 'BICDL1', 'DBNDD1', 'CRNDE', 'C2orf49', 'TPM4', 'FAM111B', 'KPNA2', 'NCKAP1', 'INF2', 'CSNK2A2', 'FARP1', 'MAP3K13', 'NCOA5', 'DNAJA3', 'RRP1B', 'HCFC1', 'ACTB', 'GATA3', 'DHX38', 'TSPYL1', 'TBC1D9', 'IWS1', 'FAM50A', 'AFF1', 'WDR43', 'SHISA5', 'CLTB', 'ETF1', 'RSRC2', 'GNAQ', 'BAZ2A', 'TARS1', 'KCNQ1OT1', 'YWHAB', 'EBAG9', 'PITX1', 'KPNA4', 'UTP18', 'PSMD5', 'NFIC', 'PHF20L1', 'PATL1', 'POLB', 'TNIP2', 'ARIH1', 'KLC1', 'ZBTB7A', 'NPLOC4', 'ARFGEF1', 'TRAF3IP2', 'BMS1', 'INPP4B', 'MYH14', 'KITLG', 'ATF5', 'TBCA', 'PICALM', 'FAM13B', 'FBXL18', 'MYO10', 'TAOK3', 'BBOF1', 'CLSPN', 'PAK2', 'STRIP1', 'IFI27L2', 'LTBR', 'ESRP2', 'C6orf62', 'AAMP', 'PMEPA1', 'UBE2Q2', 'DHX37', 'SLAIN2', 'OTUD7B', 'RPL23', 'NCBP3', 'ATRX', 'CCM2', 'NOM1', 'SMIM27']
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo            2141              89
Actual Norm              54            3123
Accuracy: 0.9735528019234326
Classification report:
               precision    recall  f1-score   support

        Hypo       0.98      0.96      0.97      2230
        Norm       0.97      0.98      0.98      3177

    accuracy                           0.97      5407
   macro avg       0.97      0.97      0.97      5407
weighted avg       0.97      0.97      0.97      5407

In [ ]:
ds_hcc_logit, ds_hcc_logit_features, ds_hcc_logit_accuracy = feature_selection_svm_rfe(
    X = X_ds_hcc,
    y = y_ds_hcc,
    estimator = LogisticRegression(max_iter = 10_000),
    estimator_params = {"C": [0.1, 1, 2]},
    n_jobs = -1
)
Training data dimensions: (11011, 3000)
Testing data dimensions: (3671, 3000)
========================= Training =========================
Best Parameters: {'estimator__C': 0.1}
Best Score (CV avg): 0.9421202743207646
Training accuracy: 0.9611172478461597
No description has been provided for this image
Number of selected genes: 399
Selected genes: ['BCYRN1', 'IGFBP3', 'H2AC11', 'RAPGEF3', 'DDIT4', 'MB', 'MT-TV', 'ADAP1', 'MT-TL1', 'NDRG1', 'MIR210HG', 'MT-TQ', 'SPN', 'ZNF263', 'MT-CO3', 'CACHD1', 'BTBD9', 'GDPGP1', 'ARTN', 'LINC01304', 'CACNB2', 'HELQ', 'NAXD', 'GPM6A', 'MT-TS1', 'MT-TA', 'CRIP2', 'NEAT1', 'DUSP9', 'PRR5L', 'USP35', 'MT-ND1', 'MT-CYB', 'H19', 'H2AC12', 'CNR2', 'FGF23', 'DANT1', 'AKR1C2', 'TMSB10', 'EHBP1L1', 'LDHA', 'MT-ND6', 'EFNA2', 'CITED2', 'H4C5', 'CNOT6L', 'CLDN4', 'CPNE2', 'DTNB', 'GABRE', 'LINC02511', 'MT-ATP6', 'LGALS1', 'NUPR2', 'H2AC16', 'MPDU1', 'ATXN2L', 'PGAM1', 'PPIL1', 'NCL', 'COL6A3', 'ABO', 'RPL17', 'SLC2A1', 'C4orf3', 'NOP10', 'KLC2', 'MT-CO2', 'NCALD', 'EGLN3', 'OPTN', 'ANKRD9', 'TRAK1', 'CNNM2', 'RRAS', 'BNIP3', 'ENTR1', 'FGF8', 'B4GALT1', 'GPI', 'LIMCH1', 'MIR663AHG', 'SREK1IP1P1', 'FBXL17', 'LINC02541', 'P4HA1', 'H2BC4', 'RPL41', 'COX8A', 'MT-TS2', 'PROSER1', 'PGK1', 'CAMK2N1', 'HEPACAM', 'MSR1', 'GDI1', 'SIGMAR1', 'AHNAK2', 'GCAT', 'SINHCAFP3', 'CTXN1', 'LINC01133', 'POLR3GL', 'HES4', 'PDCD4', 'TNFRSF12A', 'ENKD1', 'SHOX', 'RGPD4-AS1', 'HIF3A', 'S100A10', 'APOOL', 'RTL8C', 'ARNTL', 'HMGB2', 'NEDD9', 'TMEM70', 'FASTKD5', 'DAAM1', 'HSP90AB1', 'ZBED2', 'EFNA5', 'PSMG1', 'TMSB4XP4', 'NPM1P40', 'RPL39', 'AJAP1', 'SAMD4A', 'WDR77', 'PAQR7', 'NDUFB4', 'BTN3A2', 'VIT', 'ARHGDIA', 'H3C2', 'FOSL2', 'MIXL1', 'MCM3AP', 'GJB3', 'PRRC2A', 'FSD1L', 'IVL', 'KCNJ3', 'BNIP3L', 'S100A11', 'BMPR1B', 'H2BC9', 'TNNT1', 'CEP120', 'LINC02367', 'RAB30', 'ZBED4', 'RAB11FIP4', 'RNF122', 'NEDD4L', 'RAB2B', 'RPS27', 'CSTB', 'C1orf53', 'NCK1', 'CPEB1', 'MLLT3', 'MELTF-AS1', 'TCF7L1', 'NT5C', 'MT1E', 'RPSAP48', 'TNFSF13B', 'ECH1', 'NDUFA8', 'MIOS-DT', 'KRT19', 'ZNF318', 'POLDIP2', 'VPS45', 'ZNF418', 'YTHDF3', 'MT-ND4L', 'PI4KB', 'ADARB1', 'AXL', 'CACNA1A', 'TUBB6', 'NRG4', 'NMD3', 'FAM126B', 'PHACTR1', 'TXNRD2', 'BAP1', 'HSPD1', 'PLD1', 'JAKMIP3', 'DDX23', 'RPL28', 'ANKEF1', 'RPS6KA6', 'DUSP5', 'SH3RF1', 'ARHGEF26', 'SLC6A8', 'JUN', 'OVOL1', 'APEH', 'CAVIN3', 'ZNF302', 'DCAKD', 'ARL2', 'LINC01902', 'RBSN', 'CREB1', 'TATDN2', 'PRRG3', 'RPS21', 'ALDOC', 'MMP2', 'POLE4', 'PTGR1', 'CCDC168', 'GBP1P1', 'TSHZ2', 'IRF2BPL', 'ADM', 'CAST', 'RPS29', 'AKR1C1', 'PCDHGA10', 'RGS10', 'TGDS', 'EPHX1', 'KAT7', 'NEUROD2', 'CFAP251', 'MXRA5', 'PFKFB3', 'PLOD2', 'PPTC7', 'ING2', 'CD47', 'ZNF33B', 'KIRREL1', 'KDM3A', 'UQCC2', 'FUT11', 'MXI1', 'MED18', 'SYNJ2', 'SNHG18', 'RNF25', 'AKT1S1', 'KLLN', 'NCAM1', 'RAB12', 'PDLIM1', 'MT1X', 'DERA', 'YTHDF1', 'AMFR', 'CEP83', 'SF3B4', 'POLR3A', 'PHRF1', 'GYS1', 'SRA1', 'EPPK1', 'SYT14', 'FAM162A', 'KCNJ2', 'ARMC6', 'MKNK1', 'HSP90AA1', 'INHBA', 'FYN', 'BTBD7P1', 'CENPB', 'RHBDD2', 'SNX22', 'SLC2A6', 'LINC01116', 'ISOC2', 'MPHOSPH6', 'JUND', 'RAB3GAP1', 'MNS1', 'DTYMK', 'TOLLIP', 'GIN1', 'FAH', 'GOLGA4', 'TMEM256', 'DGKD', 'WDR43', 'CAMSAP2', 'NACA4P', 'ARHGAP42', 'NDUFC1', 'GAPDH', 'TMEM238', 'GRK2', 'DNAH11', 'ZBTB2', 'TRIM44', 'CIAO2A', 'UTP3', 'CALM2', 'BRMS1', 'PCDHB1', 'TTL', 'FOSL1', 'YKT6', 'ACSL4', 'CCDC34', 'SAT2', 'RHOT2', 'MAD2L1', 'DBT', 'RPL27A', 'RPL37A', 'NUP93', 'AMOTL2', 'PPP4R2', 'CARM1', 'VEGFB', 'NCLN', 'MLLT6', 'MAP2K3', 'DNAAF5', 'PSMA7', 'DDX54', 'TCEAL9', 'RPLP0P2', 'KRT4', 'SNORD3B-1', 'FEM1A', 'TRIM52-AS1', 'MCM4', 'CCNG2', 'YWHAZ', 'ARID5B', 'MRPL55', 'KMT2D', 'SPG21', 'ZC3H15', 'EMP2', 'LETM1', 'EIF3J', 'SNRNP70', 'RHOD', 'MAFF', 'MAZ', 'UQCR11', 'PLCE1', 'CPTP', 'ARHGEF7', 'STMN1', 'ZNF202', 'SNHG9', 'HMGA1', 'CLIC1', 'ZHX1', 'TSPO', 'TPD52L1', 'FRY', 'DNMT3A', 'ARL13B', 'SMARCB1', 'RRS1', 'HEY1', 'SLC25A48', 'TMEM80', 'DYSF', 'MTA2', 'C19orf53', 'ARSA', 'DGKZ', 'VRK3', 'UIMC1', 'PSIP1', 'ZNF688', 'CMIP', 'PPIG', 'EXOC7', 'TAF15', 'MARCKS', 'AK4', 'KIF5B', 'ATP5F1E', 'IRAK1', 'BRAT1', 'TSR1', 'SART1', 'CAP1', 'SETD2', 'METTL26', 'STC2', 'DDIT3', 'KEAP1', 'DLD', 'CLIP2']
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo            2115             110
Actual Norm              94            1352
Accuracy: 0.944429310814492
Classification report:
               precision    recall  f1-score   support

        Hypo       0.96      0.95      0.95      2225
        Norm       0.92      0.93      0.93      1446

    accuracy                           0.94      3671
   macro avg       0.94      0.94      0.94      3671
weighted avg       0.94      0.94      0.94      3671

Select the top principal components from the models trained on PCA encoded data.

In [ ]:
print("SmartSeq MCF7 Logistic Regression")
ss_mcf7_pca_logit_pcs = get_selected_pcs_from_model(ss_mcf7_pca_logit.model)
print()

print("SmartSeq HCC Logistic Regression")
ss_hcc_pca_logit_pcs = get_selected_pcs_from_model(ss_hcc_pca_logit.model)
print()

print("DropSeq MCF7 Logistic Regression")
ds_mcf7_pca_logit_pcs = get_selected_pcs_from_model(ds_mcf7_pca_logit.model)
print()

print("DropSeq HCC Logistic Regression")
ds_hcc_pca_logit_pcs = get_selected_pcs_from_model(ds_hcc_pca_logit.model)
print()
SmartSeq MCF7 Logistic Regression
Top 9 principal components:
[1, 3, 6, 8, 12, 15, 16, 17, 18]

SmartSeq HCC Logistic Regression
Top 10 principal components:
[2, 3, 9, 10, 12, 13, 16, 17, 23, 26]

DropSeq MCF7 Logistic Regression
Top 310 principal components:
[1, 2, 3, 5, 6, 8, 13, 14, 15, 16, 17, 18, 19, 20, 25, 26, 27, 28, 29, 30, 31, 32, 33, 36, 37, 40, 43, 44, 45, 46, 52, 54, 55, 56, 57, 59, 60, 61, 62, 65, 66, 69, 71, 74, 81, 82, 85, 87, 88, 91, 92, 94, 95, 96, 99, 100, 104, 105, 107, 110, 112, 114, 115, 116, 118, 119, 120, 121, 127, 128, 133, 135, 138, 140, 141, 142, 145, 146, 147, 149, 153, 157, 160, 161, 167, 170, 172, 173, 175, 177, 182, 186, 188, 190, 191, 193, 195, 197, 198, 200, 201, 203, 204, 205, 206, 211, 212, 213, 218, 219, 221, 230, 231, 232, 234, 235, 236, 239, 240, 241, 247, 249, 252, 253, 254, 258, 263, 264, 265, 267, 269, 271, 273, 275, 279, 281, 282, 286, 287, 291, 293, 302, 305, 312, 317, 318, 319, 322, 323, 327, 332, 337, 339, 341, 342, 344, 348, 350, 352, 353, 361, 364, 365, 370, 371, 375, 376, 377, 380, 381, 383, 385, 387, 389, 391, 392, 393, 398, 399, 400, 401, 402, 403, 406, 408, 409, 411, 415, 418, 419, 420, 426, 427, 429, 430, 431, 433, 434, 435, 436, 437, 438, 442, 446, 449, 455, 459, 460, 461, 462, 464, 466, 467, 469, 470, 471, 475, 481, 484, 485, 486, 487, 491, 494, 495, 496, 497, 499, 504, 506, 507, 508, 510, 512, 514, 517, 520, 522, 527, 534, 538, 540, 541, 543, 546, 552, 555, 556, 557, 564, 565, 576, 577, 580, 582, 585, 591, 592, 596, 597, 598, 599, 602, 606, 610, 612, 615, 621, 623, 624, 626, 631, 632, 633, 642, 646, 647, 650, 652, 653, 655, 658, 661, 666, 672, 674, 675, 677, 681, 682, 685, 696, 698, 700, 702, 718, 722, 724, 726, 732, 733, 734, 741, 742, 746, 751, 754, 755, 756, 758]

DropSeq HCC Logistic Regression
Top 320 principal components:
[2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 15, 16, 18, 19, 20, 21, 23, 24, 26, 27, 29, 30, 31, 32, 34, 36, 37, 38, 39, 41, 45, 46, 47, 48, 49, 53, 54, 55, 60, 63, 65, 69, 72, 76, 77, 80, 86, 88, 89, 90, 92, 94, 95, 96, 97, 99, 102, 103, 106, 115, 117, 118, 120, 121, 123, 124, 125, 126, 127, 131, 135, 136, 137, 139, 140, 141, 142, 143, 145, 147, 148, 151, 152, 153, 154, 155, 157, 159, 161, 162, 166, 167, 169, 170, 171, 172, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 187, 189, 190, 191, 193, 197, 198, 199, 200, 201, 205, 207, 208, 210, 213, 215, 217, 218, 219, 220, 221, 224, 225, 227, 229, 230, 231, 233, 234, 235, 237, 238, 239, 240, 243, 245, 247, 249, 254, 255, 257, 259, 260, 261, 262, 263, 265, 266, 267, 270, 272, 275, 282, 290, 292, 294, 295, 297, 300, 301, 303, 308, 310, 312, 313, 315, 318, 319, 325, 326, 329, 332, 334, 336, 339, 341, 344, 345, 350, 351, 356, 357, 358, 359, 360, 362, 369, 371, 372, 373, 374, 375, 377, 379, 380, 383, 384, 391, 396, 399, 401, 403, 408, 409, 412, 413, 414, 415, 420, 423, 427, 438, 439, 444, 445, 447, 448, 450, 451, 453, 457, 461, 466, 474, 479, 490, 494, 499, 503, 507, 509, 515, 516, 521, 525, 528, 540, 550, 551, 561, 562, 563, 564, 566, 567, 575, 576, 577, 581, 582, 583, 584, 586, 594, 598, 600, 603, 604, 609, 610, 612, 613, 622, 630, 637, 640, 641, 642, 651, 653, 657, 672, 691, 692, 693, 698, 700, 701, 713, 717, 718, 730, 746, 747, 749, 756, 758, 760, 761, 763, 766, 769, 777, 781, 783, 784, 785, 789, 792, 793, 794, 799, 807, 808, 809, 810, 813, 814, 831, 835, 841, 842, 843]

SVM¶

Use the feature selection pipeline to select genes from the raw data.

In [ ]:
ss_mcf7_svm, ss_mcf7_svm_features, ss_mcf7_svm_accuracy = feature_selection_svm_rfe(
    X = X_ss_mcf7,
    y = y_ss_mcf7,
    estimator = LinearSVC(max_iter = 10_000),
    estimator_params = {"C": [0.025, 0.1, 1, 5]},
    n_jobs = -1
)
Training data dimensions: (187, 3000)
Testing data dimensions: (63, 3000)
========================= Training =========================
Best Parameters: {'estimator__C': 0.025}
Best Score (CV avg): 1.0
Training accuracy: 1.0
No description has been provided for this image
Number of selected genes: 33
Selected genes: ['DDIT4', 'NR4A1', 'FOS', 'STC2', 'HILPDA', 'MCM7', 'MT-CYB', 'TMEM64', 'XBP1', 'CRABP2', 'MT-CO3', 'EMP2', 'MT-CO2', 'PGK1', 'LDHA', 'STARD10', 'MT-CO1', 'DYNC2I2', 'FLNA', 'TMSB10', 'IFITM3', 'DSP', 'FAM162A', 'SULF2', 'QSOX1', 'ARPC1B', 'SYTL2', 'PSAP', 'CD9', 'HNRNPA2B1', 'GATA3', 'ATP9A', 'NME1-NME2']
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo              31               0
Actual Norm               0              32
Accuracy: 1.0
Classification report:
               precision    recall  f1-score   support

        Hypo       1.00      1.00      1.00        31
        Norm       1.00      1.00      1.00        32

    accuracy                           1.00        63
   macro avg       1.00      1.00      1.00        63
weighted avg       1.00      1.00      1.00        63

In [ ]:
ss_hcc_svm, ss_hcc_svm_features, ss_hcc_svm_accuracy = feature_selection_svm_rfe(
    X = X_ss_hcc,
    y = y_ss_hcc,
    estimator = LinearSVC(max_iter = 10_000),
    estimator_params = {"C": [0.025, 0.1, 1, 5]},
    n_jobs = -1
)
Training data dimensions: (136, 3000)
Testing data dimensions: (46, 3000)
========================= Training =========================
Best Parameters: {'estimator__C': 0.025}
Best Score (CV avg): 0.9850189600540231
Training accuracy: 1.0
No description has been provided for this image
Number of selected genes: 21
Selected genes: ['DDIT4', 'ANGPTL4', 'AKR1C2', 'MMP1', 'CDC20', 'AKR1C3', 'PLAU', 'DKK1', 'PGK1', 'HSPA5', 'LDHA', 'CD44', 'HSPA8', 'ALDOA', 'CAV1', 'TXN', 'MT-CYB', 'TUBA1B', 'NQO1', 'PRDX1', 'PSMD2']
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo              24               1
Actual Norm               0              21
Accuracy: 0.9782608695652174
Classification report:
               precision    recall  f1-score   support

        Hypo       1.00      0.96      0.98        25
        Norm       0.95      1.00      0.98        21

    accuracy                           0.98        46
   macro avg       0.98      0.98      0.98        46
weighted avg       0.98      0.98      0.98        46

In [ ]:
ds_mcf7_svm, ds_mcf7_svm_features, ds_mcf7_svm_accuracy = feature_selection_svm_rfe(
    X = X_ds_mcf7,
    y = y_ds_mcf7,
    estimator = LinearSVC(max_iter = 10_000),
    estimator_params = {"C": [0.1, 1, 5]},
    n_jobs = -1
)
Training data dimensions: (16219, 3000)
Testing data dimensions: (5407, 3000)
========================= Training =========================
Best Parameters: {'estimator__C': 0.1}
Best Score (CV avg): 0.9694665289633493
Training accuracy: 0.9858094450744297
No description has been provided for this image
Number of selected genes: 380
Selected genes: ['MT-RNR2', 'TFF1', 'MT-RNR1', 'GDF15', 'MT-CO3', 'MT-ND4', 'MT-ND3', 'MT-CYB', 'IGFBP5', 'TMSB10', 'MT-ATP6', 'MT-CO2', 'MT-TS1', 'MT-ND6', 'MT-ND2', 'MTND1P23', 'MT-ND4L', 'MT-ND5', 'MT-TN', 'MT-ND1', 'MT-TA', 'MT-TQ', 'HES1', 'MT-TM', 'LGALS1', 'TMEM64', 'MTND2P28', 'MT-TE', 'H19', 'MT-ATP8', 'H2AC12', 'NCDN', 'MT-TY', 'TOB1', 'H2AC20', 'MT-TP', 'ANKRD52', 'MT-TD', 'C16orf91', 'ATN1', 'WSB2', 'GPM6A', 'ZFP36', 'VMP1', 'TFF3', 'KMT2D', 'FGF23', 'CRTC2', 'CSK', 'PLBD2', 'ITPK1', 'PLEC', 'GOLGA4', 'PLCD3', 'PTP4A2', 'TIAM1', 'SOX4', 'BTBD9', 'H2AC11', 'CBFA2T3', 'PROSER1', 'ARF3', 'PARD6B', 'RPL13', 'TPI1', 'BTN3A2', 'GREM1', 'RNF146', 'S100A10', 'CHAC2', 'ATXN2L', 'TGFB3', 'MGRN1', 'CAPZA1', 'FAM189B', 'GSE1', 'CERS2', 'ENO1', 'SLC48A1', 'PKIB', 'RHOD', 'BLOC1S3', 'KRT19', 'RPL34', 'TCF20', 'LINC01291', 'FAM102A', 'PRRG3', 'GABPB2', 'CAMK2N1', 'VPS9D1-AS1', 'TAF13', 'INCENP', 'ZNRF1', 'NINJ1', 'ZBTB34', 'DSP', 'ZNF480', 'CALHM2', 'MSMB', 'SENP3', 'KCNJ2', 'ZBTB20', 'TPD52L1', 'HSPH1', 'HMGA1', 'CASP8AP2', 'ZNF302', 'ELOA', 'GPATCH4', 'SNX24', 'DVL3', 'SNX27', 'YTHDF3', 'GAB2', 'PACS1', 'NLK', 'THAP1', 'KCNJ3', 'LDLRAP1', 'TRAK2', 'CAMSAP2', 'PPM1G', 'HEPACAM', 'NCALD', 'LRRFIP2', 'DNAJA1', 'SMKR1', 'MAPKAPK2', 'ZNF702P', 'NACC1', 'TRIM37', 'RFK', 'FBXL16', 'TCHP', 'ISCU', 'RABEP1', 'CACNG4', 'RPSAP48', 'WWC3', 'GDAP2', 'SRCAP', 'USP32', 'FLOT2', 'MAFF', 'NCOA1', 'TWNK', 'AKAP5', 'NEDD4L', 'APOOL', 'CCDC18', 'RAB27A', 'BRPF3', 'BCAS3', 'GATAD2A', 'NSD1', 'NPM1P40', 'ANKRD40', 'ILRUN', 'PSMD14', 'STRBP', 'TPM1', 'CAV1', 'MPHOSPH9', 'ANXA6', 'PRXL2C', 'KIF14', 'PYGO2', 'ZNF688', 'KHSRP', 'BAP1', 'MDM2', 'RAB5C', 'PAQR8', 'SOS1', 'KRT80', 'SECISBP2L', 'BOLA3', 'DNAJA4', 'THRB', 'ARPP19', 'S100A11', 'FRS2', 'RGPD4-AS1', 'BRIP1', 'PRR12', 'TEDC2-AS1', 'RPL15', 'DKC1', 'C9orf78', 'NBEAL2', 'SETD3', 'FEM1A', 'SLC25A24', 'ARMC6', 'SLC13A5', 'CFAP97', 'NEDD1', 'PHLDA2', 'MARK3', 'SPATS2L', 'PAPOLA', 'MT2A', 'ZNF354A', 'SET', 'ATXN1L', 'SCYL2', 'ZNF703', 'SRFBP1', 'UBA52', 'MGLL', 'LAD1', 'ZC3H15', 'SLC25A48', 'RAD23A', 'EIF4G2', 'HOXC13', 'PITPNA', 'TAF9B', 'LXN', 'SERINC5', 'FBRS', 'SMC5', 'RAI14', 'TRIM44', 'MYO5C', 'AKT1S1', 'TBKBP1', 'EIF2B4', 'PRR34-AS1', 'PSME4', 'PDAP1', 'ARHGAP26', 'ELP3', 'SENP6', 'DNAJC21', 'FAM104A', 'CS', 'ABL1', 'EIF3A', 'MARK2', 'LCLAT1', 'S100P', 'RCC1L', 'ANKRD17', 'TMEM259', 'CPEB4', 'RAB1B', 'GAPDH', 'TMEM258', 'SSX2IP', 'PDS5A', 'FAM177A1', 'NAA10', 'CNOT9', 'PGK1', 'PKM', 'KLHL8', 'BCL3', 'PRMT6', 'CACNA1A', 'GOLGA3', 'SOCS2', 'HPCAL1', 'PPP1R12B', 'DCTN1', 'C7orf50', 'ZMIZ1', 'PGAM5', 'RPL30', 'ARNTL2', 'PREX1', 'LYAR', 'PRRC2C', 'PCYT1A', 'GLE1', 'ZFC3H1', 'BMPR1B', 'RBBP6', 'ZNF764', 'RAB35', 'ENOX2', 'LMNB2', 'ZNF326', 'ARID1B', 'TIMELESS', 'PFDN4', 'LPP', 'SYNE2', 'ZRANB1', 'PLCB4', 'CBX3', 'NOL4L', 'SPRY1', 'RPS6KA6', 'CKS2', 'SMC6', 'AURKA', 'BICDL1', 'DBNDD1', 'CRNDE', 'C2orf49', 'TPM4', 'FAM111B', 'KPNA2', 'NCKAP1', 'INF2', 'CSNK2A2', 'FARP1', 'FGD5-AS1', 'MAP3K13', 'NCOA5', 'DNAJA3', 'RRP1B', 'HCFC1', 'ACTB', 'GATA3', 'DHX38', 'TSPYL1', 'TBC1D9', 'IWS1', 'FAM50A', 'AFF1', 'WDR43', 'SHISA5', 'CLTB', 'ETF1', 'RSRC2', 'GNAQ', 'BAZ2A', 'TARS1', 'KCNQ1OT1', 'YWHAB', 'EBAG9', 'PITX1', 'KPNA4', 'UTP18', 'PSMD5', 'NFIC', 'PHF20L1', 'PATL1', 'POLB', 'TNIP2', 'ARIH1', 'KLC1', 'ZBTB7A', 'NPLOC4', 'ARFGEF1', 'TRAF3IP2', 'BMS1', 'INPP4B', 'MYH14', 'KITLG', 'ATF5', 'TBCA', 'PICALM', 'FAM13B', 'FBXL18', 'MYO10', 'TAOK3', 'BBOF1', 'CLSPN', 'PAK2', 'STRIP1', 'IFI27L2', 'LTBR', 'BEND7', 'ESRP2', 'C6orf62', 'AAMP', 'PMEPA1', 'UBE2Q2', 'DHX37', 'SLAIN2', 'OTUD7B', 'RPL23', 'NCBP3', 'ATRX', 'NOM1', 'SMIM27']
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo            2140              90
Actual Norm              73            3104
Accuracy: 0.969853893101535
Classification report:
               precision    recall  f1-score   support

        Hypo       0.97      0.96      0.96      2230
        Norm       0.97      0.98      0.97      3177

    accuracy                           0.97      5407
   macro avg       0.97      0.97      0.97      5407
weighted avg       0.97      0.97      0.97      5407

In [ ]:
ds_hcc_svm, ds_hcc_svm_features, ds_hcc_svm_accuracy = feature_selection_svm_rfe(
    X = X_ds_hcc,
    y = y_ds_hcc,
    estimator = LinearSVC(max_iter = 10_000),
    estimator_params = {"C": [0.025, 0.1, 1, 5]},
    n_jobs = -1
)
Training data dimensions: (11011, 3000)
Testing data dimensions: (3671, 3000)
========================= Training =========================
Best Parameters: {'estimator__C': 0.025}
Best Score (CV avg): 0.9404403100550717
Training accuracy: 0.9652093585427515
No description has been provided for this image
Number of selected genes: 402
Selected genes: ['BCYRN1', 'IGFBP3', 'H2AC11', 'RAPGEF3', 'DDIT4', 'MB', 'MT-TV', 'MT-TL1', 'NDRG1', 'MIR210HG', 'MT-TQ', 'SPN', 'ZNF263', 'MT-CO3', 'CACHD1', 'BTBD9', 'GDPGP1', 'ARTN', 'LINC01304', 'CACNB2', 'HELQ', 'NAXD', 'GPM6A', 'MT-TS1', 'MT-TA', 'CRIP2', 'NEAT1', 'DUSP9', 'PRR5L', 'USP35', 'MT-ND1', 'MT-CYB', 'PHC1', 'H19', 'H2AC12', 'CNR2', 'FGF23', 'MUL1', 'DANT1', 'AKR1C2', 'TMSB10', 'EHBP1L1', 'LDHA', 'MT-ND6', 'EFNA2', 'CITED2', 'H4C5', 'CNOT6L', 'CLDN4', 'CPNE2', 'DTNB', 'GABRE', 'LINC02511', 'MT-ATP6', 'LGALS1', 'NUPR2', 'H2AC16', 'MPDU1', 'ATXN2L', 'PGAM1', 'PPIL1', 'NCL', 'COL6A3', 'ABO', 'RPL17', 'SLC2A1', 'C4orf3', 'NOP10', 'KLC2', 'MT-CO2', 'NCALD', 'EGLN3', 'OPTN', 'ANKRD9', 'TRAK1', 'CNNM2', 'RRAS', 'BNIP3', 'ENTR1', 'FGF8', 'B4GALT1', 'GPI', 'LIMCH1', 'MIR663AHG', 'SREK1IP1P1', 'FBXL17', 'LINC02541', 'P4HA1', 'H2BC4', 'RPL41', 'COX8A', 'MT-TS2', 'PROSER1', 'PGK1', 'CAMK2N1', 'HEPACAM', 'MSR1', 'GDI1', 'SIGMAR1', 'AHNAK2', 'GCAT', 'SINHCAFP3', 'CTXN1', 'LINC01133', 'POLR3GL', 'HES4', 'PDCD4', 'TNFRSF12A', 'ENKD1', 'SHOX', 'RGPD4-AS1', 'HIF3A', 'S100A10', 'APOOL', 'RTL8C', 'ARNTL', 'HMGB2', 'NEDD9', 'TMEM70', 'FASTKD5', 'DAAM1', 'HSP90AB1', 'ZBED2', 'EFNA5', 'PSMG1', 'TMSB4XP4', 'NPM1P40', 'RPL39', 'AJAP1', 'SAMD4A', 'WDR77', 'PAQR7', 'NDUFB4', 'BTN3A2', 'VIT', 'ARHGDIA', 'H3C2', 'FOSL2', 'MIXL1', 'MCM3AP', 'GJB3', 'PRRC2A', 'FSD1L', 'IVL', 'KCNJ3', 'BNIP3L', 'S100A11', 'BMPR1B', 'H2BC9', 'TNNT1', 'CEP120', 'LINC02367', 'RAB30', 'ZBED4', 'RAB11FIP4', 'RNF122', 'NEDD4L', 'RAB2B', 'RPS27', 'CSTB', 'C1orf53', 'NCK1', 'CPEB1', 'MLLT3', 'MELTF-AS1', 'TCF7L1', 'MT1E', 'RPSAP48', 'TNFSF13B', 'ECH1', 'NDUFA8', 'MIOS-DT', 'KRT19', 'ZNF318', 'POLDIP2', 'VPS45', 'ZNF418', 'YTHDF3', 'MT-ND4L', 'PI4KB', 'ADARB1', 'AXL', 'CACNA1A', 'TUBB6', 'NRG4', 'NMD3', 'FAM126B', 'PHACTR1', 'TXNRD2', 'BAP1', 'HSPD1', 'PLD1', 'JAKMIP3', 'DDX23', 'RPL28', 'ANKEF1', 'RPS6KA6', 'DUSP5', 'SH3RF1', 'ARHGEF26', 'SLC6A8', 'JUN', 'OVOL1', 'APEH', 'CAVIN3', 'ZNF302', 'DCAKD', 'ARL2', 'LINC01902', 'RBSN', 'CREB1', 'TATDN2', 'PRRG3', 'RPS21', 'ALDOC', 'MMP2', 'POLE4', 'PTGR1', 'CCDC168', 'GBP1P1', 'TSHZ2', 'IRF2BPL', 'ADM', 'ZBTB20', 'CAST', 'RPS29', 'AKR1C1', 'PCDHGA10', 'RGS10', 'TGDS', 'EPHX1', 'KAT7', 'NEUROD2', 'CFAP251', 'MXRA5', 'PFKFB3', 'PLOD2', 'PPTC7', 'ING2', 'CD47', 'ZNF33B', 'KIRREL1', 'KDM3A', 'UQCC2', 'FUT11', 'MXI1', 'MED18', 'SYNJ2', 'SNHG18', 'RNF25', 'AKT1S1', 'KLLN', 'NCAM1', 'RAB12', 'PDLIM1', 'MT1X', 'DERA', 'YTHDF1', 'AMFR', 'CEP83', 'SF3B4', 'PHRF1', 'GYS1', 'SRA1', 'EPPK1', 'SYT14', 'FAM162A', 'KCNJ2', 'ARMC6', 'MKNK1', 'HSP90AA1', 'INHBA', 'SRSF8', 'FYN', 'BTBD7P1', 'CENPB', 'RHBDD2', 'SNX22', 'SLC2A6', 'LINC01116', 'ISOC2', 'MPHOSPH6', 'JUND', 'RAB3GAP1', 'MNS1', 'DTYMK', 'TOLLIP', 'GIN1', 'FAH', 'GOLGA4', 'TMEM256', 'DGKD', 'WDR43', 'CAMSAP2', 'NACA4P', 'ARHGAP42', 'NDUFC1', 'GAPDH', 'TMEM238', 'GRK2', 'DNAH11', 'ZBTB2', 'TRIM44', 'CIAO2A', 'UTP3', 'CALM2', 'BRMS1', 'PCDHB1', 'TTL', 'FOSL1', 'YKT6', 'ACSL4', 'CCDC34', 'SAT2', 'RHOT2', 'MAD2L1', 'DBT', 'RPL27A', 'RPL37A', 'NUP93', 'AMOTL2', 'PPP4R2', 'CARM1', 'VEGFB', 'NCLN', 'MLLT6', 'MAP2K3', 'DNAAF5', 'PUSL1', 'PSMA7', 'DDX54', 'TCEAL9', 'RPLP0P2', 'KRT4', 'SNORD3B-1', 'FEM1A', 'TRIM52-AS1', 'MCM4', 'CCNG2', 'YWHAZ', 'ARID5B', 'MRPL55', 'KMT2D', 'SPG21', 'ZC3H15', 'EMP2', 'LETM1', 'EIF3J', 'SNRNP70', 'RHOD', 'MAFF', 'MAZ', 'UQCR11', 'PLCE1', 'CPTP', 'ARHGEF7', 'STMN1', 'ZNF202', 'SNHG9', 'HMGA1', 'CLIC1', 'ZHX1', 'TPD52L1', 'FRY', 'DNMT3A', 'ARL13B', 'SMARCB1', 'TWNK', 'RRS1', 'HEY1', 'MRPS2', 'SLC25A48', 'TMEM80', 'DYSF', 'MTA2', 'C19orf53', 'ARSA', 'DGKZ', 'VRK3', 'UIMC1', 'PSIP1', 'ZNF688', 'CMIP', 'PPIG', 'EXOC7', 'TAF15', 'MARCKS', 'AK4', 'KIF5B', 'ATP5F1E', 'IRAK1', 'BRAT1', 'TSR1', 'SART1', 'CAP1', 'SETD2', 'METTL26', 'STC2', 'DDIT3', 'KEAP1', 'DLD', 'CLIP2']
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo            2117             108
Actual Norm              92            1354
Accuracy: 0.9455189321710705
Classification report:
               precision    recall  f1-score   support

        Hypo       0.96      0.95      0.95      2225
        Norm       0.93      0.94      0.93      1446

    accuracy                           0.95      3671
   macro avg       0.94      0.94      0.94      3671
weighted avg       0.95      0.95      0.95      3671

Select the top principal components from the models trained on PCA encoded data.

In [ ]:
print("SmartSeq MCF7 SVM")
ss_mcf7_pca_svm_pcs = get_selected_pcs_from_model(ss_mcf7_pca_svm.model)
print()

print("SmartSeq HCC SVM")
ss_hcc_pca_svm_pcs = get_selected_pcs_from_model(ss_hcc_pca_svm.model)
print()

print("DropSeq MCF7 SVM")
ds_mcf7_pca_svm_pcs = get_selected_pcs_from_model(ds_mcf7_pca_svm.model)
print()

print("DropSeq HCC SVM")
ds_hcc_pca_svm_pcs = get_selected_pcs_from_model(ds_hcc_pca_svm.model)
print()
SmartSeq MCF7 SVM
Top 8 principal components:
[1, 3, 6, 8, 12, 16, 17, 18]

SmartSeq HCC SVM
Top 11 principal components:
[2, 3, 9, 10, 12, 15, 17, 21, 26, 30, 32]

DropSeq MCF7 SVM
Top 300 principal components:
[1, 2, 3, 5, 6, 8, 15, 16, 17, 18, 19, 25, 26, 27, 28, 29, 30, 31, 32, 33, 36, 37, 40, 43, 44, 45, 46, 52, 55, 56, 57, 60, 61, 62, 65, 66, 69, 71, 74, 81, 82, 85, 87, 88, 91, 92, 94, 95, 99, 100, 104, 105, 107, 110, 112, 114, 115, 116, 118, 119, 120, 121, 127, 128, 135, 138, 140, 141, 142, 145, 146, 147, 149, 153, 157, 160, 161, 167, 170, 172, 173, 175, 177, 181, 188, 190, 191, 193, 195, 198, 200, 201, 203, 204, 205, 206, 211, 212, 213, 218, 219, 230, 231, 232, 234, 235, 239, 243, 245, 247, 249, 252, 254, 255, 257, 260, 263, 264, 267, 271, 273, 275, 279, 281, 282, 286, 287, 291, 293, 302, 305, 312, 317, 318, 319, 320, 322, 323, 327, 329, 332, 339, 341, 342, 344, 348, 350, 352, 353, 355, 361, 362, 364, 370, 371, 375, 380, 383, 385, 387, 389, 391, 392, 393, 398, 399, 400, 401, 402, 406, 408, 409, 411, 415, 418, 419, 429, 431, 433, 434, 435, 436, 438, 442, 449, 455, 457, 459, 460, 461, 462, 464, 466, 469, 470, 471, 481, 483, 484, 485, 486, 487, 491, 494, 495, 496, 497, 499, 504, 507, 508, 510, 512, 515, 517, 518, 519, 520, 522, 527, 534, 538, 540, 541, 543, 546, 552, 555, 556, 557, 564, 565, 571, 576, 579, 580, 582, 585, 591, 594, 596, 597, 598, 599, 602, 603, 606, 610, 612, 615, 621, 623, 626, 627, 631, 632, 633, 642, 644, 646, 647, 650, 652, 653, 655, 658, 661, 667, 672, 674, 675, 677, 681, 682, 687, 696, 698, 700, 702, 705, 711, 718, 722, 724, 726, 729, 732, 733, 734, 736, 741, 742, 743, 745, 746, 751, 754, 755, 756, 758]

DropSeq HCC SVM
Top 343 principal components:
[2, 3, 4, 5, 6, 8, 11, 12, 15, 16, 18, 19, 20, 21, 23, 24, 26, 27, 29, 30, 31, 32, 34, 36, 37, 38, 39, 41, 45, 46, 47, 48, 49, 53, 54, 55, 63, 65, 69, 72, 76, 77, 88, 89, 90, 92, 94, 96, 99, 102, 103, 106, 115, 117, 118, 120, 123, 124, 127, 131, 135, 136, 137, 139, 140, 141, 142, 143, 145, 147, 148, 153, 154, 155, 157, 161, 162, 166, 167, 169, 170, 171, 172, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 187, 189, 190, 191, 193, 197, 198, 200, 201, 205, 208, 210, 213, 215, 217, 218, 219, 220, 221, 224, 225, 227, 229, 230, 231, 234, 235, 237, 238, 239, 240, 243, 247, 249, 253, 254, 255, 259, 260, 261, 262, 263, 265, 266, 269, 270, 272, 275, 282, 290, 292, 294, 295, 297, 300, 301, 303, 305, 308, 310, 312, 313, 315, 318, 319, 323, 325, 326, 329, 332, 334, 336, 338, 339, 341, 345, 350, 351, 356, 357, 358, 359, 360, 362, 369, 371, 372, 373, 374, 375, 377, 379, 380, 382, 383, 384, 391, 393, 396, 399, 401, 402, 408, 409, 412, 413, 414, 415, 420, 427, 429, 434, 438, 439, 443, 444, 445, 447, 448, 450, 451, 452, 453, 457, 461, 474, 479, 488, 490, 494, 497, 503, 507, 509, 510, 515, 516, 521, 523, 525, 528, 534, 539, 540, 541, 543, 550, 551, 553, 561, 562, 563, 564, 565, 566, 567, 575, 576, 577, 581, 582, 583, 584, 586, 593, 594, 597, 598, 599, 600, 601, 603, 608, 609, 610, 612, 613, 622, 630, 632, 637, 640, 641, 642, 644, 646, 651, 656, 657, 667, 672, 674, 685, 689, 691, 692, 693, 698, 699, 700, 701, 705, 713, 714, 716, 717, 718, 730, 733, 746, 747, 749, 755, 756, 758, 760, 761, 763, 766, 769, 771, 777, 778, 781, 783, 784, 785, 789, 792, 793, 794, 799, 807, 808, 809, 810, 812, 813, 814, 815, 820, 828, 829, 831, 832, 835, 839, 842, 843]

Random forest¶

Use the feature selection pipeline to select genes from the raw data.

In [ ]:
ss_mcf7_random_forest, ss_mcf7_random_forest_features, ss_mcf7_random_forest_accuracy = feature_selection_random_forest(
    X = X_ss_mcf7,
    y = y_ss_mcf7,
    estimator = RandomForestClassifier(),
    estimator_params = {
        "n_estimators": [100, 200],
        "max_depth": [5, 10, 20],
        "min_samples_split": [5, 10],
        "min_samples_leaf": [2, 4, 8],
        "max_features": ["sqrt", 0.5],
        "bootstrap": [True],
        "class_weight": ["balanced"]
    },
    n_jobs = -1
)
Training data dimensions: (187, 3000)
Testing data dimensions: (63, 3000)
========================= Training =========================
Best Parameters: {'estimator__bootstrap': True, 'estimator__class_weight': 'balanced', 'estimator__max_depth': 5, 'estimator__max_features': 'sqrt', 'estimator__min_samples_leaf': 2, 'estimator__min_samples_split': 5, 'estimator__n_estimators': 100}
Best Score (CV avg): 0.9945945945945945
Training accuracy: 1.0
No description has been provided for this image
Number of selected genes: 54
Selected genes: ['CYP1B1', 'CYP1B1-AS1', 'NDRG1', 'PFKFB3', 'HK2', 'ADM', 'VEGFA', 'BNIP3', 'PFKFB4', 'ENO2', 'MT-CYB', 'SLC9A3R1', 'UBC', 'MT-CO3', 'GPI', 'EMP2', 'MT-CO2', 'DSCAM-AS1', 'PGK1', 'MT-CO1', 'DYNC2I2', 'SLC3A2', 'IFITM3', 'ERO1A', 'DSP', 'IRF2BP2', 'TUBG1', 'MT-ATP6', 'FUT11', 'P4HA1', 'FAM162A', 'PDK1', 'BNIP3L', 'MOV10', 'IFITM2', 'PYCR3', 'FDFT1', 'PFKP', 'ACLY', 'GAPDH', 'FDPS', 'FASN', 'TST', 'APEH', 'PSME2', 'SNRNP25', 'NECTIN2', 'TUBD1', 'MTATP6P1', 'EBP', 'ALDOA', 'CYB561A3', 'ACAT2', 'SQLE']
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo              31               0
Actual Norm               0              32
Accuracy: 1.0
Classification report:
               precision    recall  f1-score   support

        Hypo       1.00      1.00      1.00        31
        Norm       1.00      1.00      1.00        32

    accuracy                           1.00        63
   macro avg       1.00      1.00      1.00        63
weighted avg       1.00      1.00      1.00        63

In [ ]:
ss_hcc_random_forest, ss_hcc_random_forest_features, ss_hcc_random_forest_accuracy = feature_selection_random_forest(
    X = X_ss_hcc,
    y = y_ss_hcc,
    estimator = RandomForestClassifier(),
    estimator_params = {
        "n_estimators": [100, 200],
        "max_depth": [5, 10, 20],
        "min_samples_split": [5, 10],
        "min_samples_leaf": [2, 4, 8],
        "max_features": ["sqrt", 0.5],
        "bootstrap": [True],
        "class_weight": ["balanced"]
    },
    n_jobs = -1
)
Training data dimensions: (136, 3000)
Testing data dimensions: (46, 3000)
========================= Training =========================
Best Parameters: {'estimator__bootstrap': True, 'estimator__class_weight': 'balanced', 'estimator__max_depth': 5, 'estimator__max_features': 'sqrt', 'estimator__min_samples_leaf': 2, 'estimator__min_samples_split': 5, 'estimator__n_estimators': 100}
Best Score (CV avg): 0.9924263674614305
Training accuracy: 0.9926147162639153
No description has been provided for this image
Number of selected genes: 64
Selected genes: ['DDIT4', 'ANGPTL4', 'NDRG1', 'EGLN3', 'CA9', 'PLIN2', 'UPK1B', 'FAM83A', 'BNIP3', 'INSIG1', 'KRT19', 'BHLHE40', 'ALDOC', 'GPRC5A', 'KCTD11', 'SLC2A1', 'PGK1', 'SLC6A8', 'LOXL2', 'CDKN1A', 'PFKFB3', 'LDHA', 'ARRDC3', 'ADM', 'BUB1B', 'HILPDA', 'LBH', 'BUB1', 'FOSL2', 'KYNU', 'ASB2', 'ERO1A', 'EIF5', 'P4HA1', 'C1orf116', 'RALGDS', 'SNX33', 'MOB3A', 'GPI', 'CALB1', 'ALDOA', 'BLCAP', 'PLOD2', 'ZNF473', 'HES1', 'GYS1', 'ENO2', 'TMEM45A', 'BNIP3L', 'PLAC8', 'TPBG', 'C4orf3', 'EGLN1', 'PRSS8', 'FAM13A', 'SRM', 'HSPH1', 'MIF', 'LDHB', 'PPP1R3G', 'FUT11', 'FAM162A', 'KDM3A', 'P4HA2']
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo              24               1
Actual Norm               0              21
Accuracy: 0.9782608695652174
Classification report:
               precision    recall  f1-score   support

        Hypo       1.00      0.96      0.98        25
        Norm       0.95      1.00      0.98        21

    accuracy                           0.98        46
   macro avg       0.98      0.98      0.98        46
weighted avg       0.98      0.98      0.98        46

In [ ]:
ds_mcf7_random_forest, ds_mcf7_random_forest_features, ds_mcf7_random_forest_accuracy = feature_selection_random_forest(
    X = X_ds_mcf7,
    y = y_ds_mcf7,
    estimator = RandomForestClassifier(),
    estimator_params = {
        "n_estimators": [200, 500, 1000],
        "max_depth": [20, 50, None],
        "min_samples_split": [2, 5],
        "min_samples_leaf": [1, 2],
        "max_features": ["sqrt", 0.5, 0.8],
        "bootstrap": [True, False],
        "class_weight": ["balanced", None]
    },
    n_jobs = -1
)
Training data dimensions: (16219, 3000)
Testing data dimensions: (5407, 3000)
========================= Training =========================
Best Parameters: {'estimator__n_estimators': 1000, 'estimator__min_samples_split': 5, 'estimator__min_samples_leaf': 2, 'estimator__max_features': 'sqrt', 'estimator__max_depth': 50, 'estimator__class_weight': 'balanced', 'estimator__bootstrap': True}
Best Score (CV avg): 0.96612717826348
Training accuracy: 0.9963072692759114
No description has been provided for this image
Number of selected genes: 72
Selected genes: ['MALAT1', 'MT-RNR2', 'TFF1', 'MT-RNR1', 'H4C3', 'MT-CO3', 'MT-ND4', 'MT-ND3', 'MT-CYB', 'TMSB10', 'MT-ATP6', 'MT-CO2', 'BCYRN1', 'RPS5', 'HES1', 'LGALS1', 'TMEM64', 'DSCAM-AS1', 'RPL12', 'RPS12', 'TOB1', 'RPL39', 'RPS16', 'TFF3', 'FGF23', 'RPL35', 'SOX4', 'RPS19', 'RPLP2', 'RPL36', 'PARD6B', 'RPL13', 'TPI1', 'S100A10', 'RPS28', 'FTL', 'RPL35A', 'ENO1', 'KRT19', 'RPS14', 'RPL34', 'DSP', 'UQCRQ', 'RPS15A', 'ROMO1', 'ELOB', 'KRT8', 'RPS15', 'ATP5ME', 'S100A11', 'ATP5MK', 'NDUFB2', 'RPL15', 'SNRPD2', 'RPS27', 'SET', 'UBA52', 'RPL37A', 'KRT18', 'GAPDH', 'TMEM258', 'PGK1', 'PKM', 'RPL30', 'ACTB', 'RPL11', 'HSPB1', 'RPLP1', 'SERF2', 'COX7A2', 'COX7C', 'RPL23']
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo            2111             119
Actual Norm              54            3123
Accuracy: 0.9680044386905863
Classification report:
               precision    recall  f1-score   support

        Hypo       0.98      0.95      0.96      2230
        Norm       0.96      0.98      0.97      3177

    accuracy                           0.97      5407
   macro avg       0.97      0.96      0.97      5407
weighted avg       0.97      0.97      0.97      5407

In [ ]:
ds_hcc_random_forest, ds_hcc_random_forest_features, ds_hcc_random_forest_accuracy = feature_selection_random_forest(
    X = X_ds_hcc,
    y = y_ds_hcc,
    estimator = RandomForestClassifier(),
    estimator_params = {
        "n_estimators": [200, 500, 1000],
        "max_depth": [20, 50, None],
        "min_samples_split": [2, 5],
        "min_samples_leaf": [1, 2],
        "max_features": ["sqrt", 0.5, 0.8],
        "bootstrap": [True, False],
        "class_weight": ["balanced", None]
    },
    n_jobs = -1
)
Training data dimensions: (11011, 3000)
Testing data dimensions: (3671, 3000)
========================= Training =========================
Best Parameters: {'estimator__n_estimators': 1000, 'estimator__min_samples_split': 5, 'estimator__min_samples_leaf': 2, 'estimator__max_features': 'sqrt', 'estimator__max_depth': 50, 'estimator__class_weight': 'balanced', 'estimator__bootstrap': True}
Best Score (CV avg): 0.9260927654063373
Training accuracy: 0.9988589561630253
No description has been provided for this image
Number of selected genes: 112
Selected genes: ['MALAT1', 'MT-RNR2', 'BCYRN1', 'IGFBP3', 'H1-3', 'H4C3', 'HSPA5', 'PLEC', 'HSP90B1', 'NDRG1', 'MT-TQ', 'BTBD9', 'ENO1', 'GPM6A', 'HNRNPA2B1', 'NEAT1', 'H2AC12', 'H1-1', 'FGF23', 'AKR1C2', 'TMSB10', 'RPS28', 'LDHA', 'RPS5', 'PDIA3', 'NCL', 'NCALD', 'EGLN3', 'CNNM2', 'BNIP3', 'B4GALT1', 'EZR', 'P4HA1', 'RPL41', 'PGK1', 'AHNAK2', 'RPS19', 'RPL35', 'S100A10', 'SERF2', 'CENPF', 'HSP90AB1', 'RPL12', 'TMSB4X', 'POLR2L', 'NPM1P40', 'RPL39', 'PKM', 'KCNJ3', 'BNIP3L', 'S100A11', 'HNRNPU', 'RPS27', 'RPLP2', 'RPLP1', 'KRT19', 'RPS2', 'TPI1', 'TPT1', 'CACNA1A', 'BAP1', 'HSPD1', 'RPL28', 'ZNF302', 'GSTP1', 'EEF2', 'PRRG3', 'MT2A', 'CAST', 'S100A6', 'RPL36', 'AKR1C1', 'DSP', 'ATAD2', 'RPSA', 'ELOB', 'RPS8', 'CBX3', 'RPL21', 'HSP90AA1', 'WDR43', 'GAPDH', 'RPS3', 'TRIM44', 'TPX2', 'CALM2', 'FOSL1', 'PTMS', 'RPL27A', 'RPL37A', 'RPL8', 'UQCRQ', 'PSMA7', 'HNRNPM', 'EEF1A1', 'YWHAZ', 'RPL37', 'RPL10', 'ZC3H15', 'EIF3J', 'STMN1', 'PABPC1', 'HMGA1', 'ATP5MG', 'DNMT1', 'RPL13', 'CAV1', 'C19orf53', 'MARCKS', 'ATP5F1E', 'RPS14', 'RAC1']
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo            2127              98
Actual Norm             170            1276
Accuracy: 0.9269953691092345
Classification report:
               precision    recall  f1-score   support

        Hypo       0.93      0.96      0.94      2225
        Norm       0.93      0.88      0.90      1446

    accuracy                           0.93      3671
   macro avg       0.93      0.92      0.92      3671
weighted avg       0.93      0.93      0.93      3671

Select the top principal components from the models trained on PCA encoded data.

In [ ]:
print("SmartSeq MCF7 Random Forest")
ss_mcf7_pca_random_forest_pcs = get_selected_pcs_from_model(ss_mcf7_pca_random_forest.model)
print()

print("SmartSeq HCC Random Forest")
ss_hcc_pca_random_forest_pcs = get_selected_pcs_from_model(ss_hcc_pca_random_forest.model)
print()

print("DropSeq MCF7 Random Forest")
ds_mcf7_pca_random_forest_pcs = get_selected_pcs_from_model(ds_mcf7_pca_random_forest.model)
print()

print("DropSeq HCC Random Forest")
ds_hcc_pca_random_forest_pcs = get_selected_pcs_from_model(ds_hcc_pca_random_forest.model)
print()
SmartSeq MCF7 Random Forest
Top 7 principal components:
[1, 2, 3, 4, 5, 6, 9]

SmartSeq HCC Random Forest
Top 3 principal components:
[2, 3, 4]

DropSeq MCF7 Random Forest
Top 82 principal components:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 18, 19, 21, 23, 25, 26, 27, 28, 32, 33, 35, 36, 37, 38, 47, 48, 55, 60, 82, 95, 97, 110, 116, 121, 133, 140, 142, 144, 145, 147, 149, 151, 157, 167, 170, 232, 236, 278, 297, 300, 301, 303, 307, 314, 317, 318, 320, 321, 322, 326, 328, 331, 335, 338, 344, 358, 367, 377, 379, 380, 381, 383, 403, 422, 424, 442, 458, 474, 475]

DropSeq HCC Random Forest
Top 104 principal components:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 23, 24, 26, 30, 31, 34, 35, 36, 37, 38, 39, 41, 42, 44, 45, 46, 47, 48, 49, 53, 54, 56, 60, 63, 65, 68, 69, 70, 72, 73, 74, 75, 76, 78, 80, 90, 99, 100, 102, 103, 106, 109, 110, 126, 127, 133, 134, 141, 142, 145, 147, 155, 156, 161, 167, 169, 170, 172, 174, 176, 182, 184, 185, 187, 189, 190, 192, 195, 196, 197, 198, 201, 202, 208, 212, 240, 250, 259, 262, 270, 283, 290, 301, 317]

Multilayer perceptron¶

Since multilayer perceptrons model complex interactions through their weights, they do not expose a "feature importances" property. This eliminates the possibility of recursive feature elimination. There is thus not much of a meaningful way to select features beyond using ANOVA or a model pre-selector (LinearSVC or RandomForestClassifier) which has already been done for the other models. Moreover, there is also no way to intrinsically select the top principal components from the pre-trained models.

However, due to diverse selection of features already done on the other models, there is already a robust, filtered feature set that isn't determined by any single model.

Top genes¶

The intersection of the selected genes for each data set, i.e. genes with X # of occurrences, can be used to develop a more robust, generalized model that can efficiently classify data. Feature selection on the models trained on the not PCA-encoded data selects genes with stronger predictive relationships with the label. Given a certain model, a certain gene may not be selected for all of them since there are different datasets. To understand the big picture, the selected genes can be collected and their occurrences can be counted to show in how many models the gene is selected.

In [ ]:
logit_feature_lists = [ss_mcf7_logit_features, ss_hcc_logit_features, ds_mcf7_logit_features, ds_hcc_logit_features]
top_logit_features_occurrences = count_and_sort_occurrences(logit_feature_lists, False)
top_logit_features = {}

for i in range(4, 0, -1):
    top_logit_features[i] = top_logit_features.get(i + 1, []) + filter_by_occurrences(top_logit_features_occurrences, i)

for i in range(4, 0, -1):
    if len(top_logit_features[i]) == 0:
        continue
    print(f"{len(top_logit_features[i])} gene(s) selected for logistic regression across {i}+ data set(s):")
    print(top_logit_features[i])
    print()
3 gene(s) selected for logistic regression across 4+ data set(s):
['MT-CO3', 'PGK1', 'MT-CYB']

10 gene(s) selected for logistic regression across 3+ data set(s):
['MT-CO3', 'PGK1', 'MT-CYB', 'DDIT4', 'LDHA', 'MT-CO2', 'MT-ATP6', 'TMSB10', 'HMGA1', 'KRT19']

95 gene(s) selected for logistic regression across 2+ data set(s):
['MT-CO3', 'PGK1', 'MT-CYB', 'DDIT4', 'LDHA', 'MT-CO2', 'MT-ATP6', 'TMSB10', 'HMGA1', 'KRT19', 'TRIM44', 'MT-TA', 'GPM6A', 'FUT11', 'MT-ND6', 'NDRG1', 'MT-RNR1', 'BAP1', 'PLOD2', 'IGFBP3', 'MT-ND4L', 'TUBA1B', 'HSPH1', 'CSTB', 'AURKA', 'ATXN2L', 'MT-ND1', 'GPI', 'EGLN3', 'BMPR1B', 'BNIP3', 'CAMK2N1', 'NCALD', 'CACNA1A', 'CAV1', 'DHCR7', 'PFKFB3', 'S100A11', 'C4orf3', 'S100A10', 'GOLGA4', 'GATA3', 'GAPDH', 'BTN3A2', 'BTBD9', 'KCNJ2', 'KCNJ3', 'RPSAP48', 'MT-TS1', 'MT-TQ', 'TPD52L1', 'BNIP3L', 'RPS6KA6', 'NEDD4L', 'ALDOC', 'SLC6A8', 'AMOTL2', 'PROSER1', 'FOSL2', 'ALDOA', 'NPM1P40', 'AKT1S1', 'ZC3H15', 'H2AC12', 'AKR1C2', 'AKR1C1', 'EMP2', 'HES1', 'ZNF302', 'CKS2', 'PRRG3', 'ADM', 'CLDN4', 'FAM162A', 'ZNF688', 'P4HA1', 'MAFF', 'H2AC11', 'CAMSAP2', 'YTHDF3', 'RHOD', 'ARMC6', 'KMT2D', 'SLC25A48', 'KPNA2', 'HSP90AA1', 'LGALS1', 'KRT4', 'SLC2A1', 'FEM1A', 'H19', 'FGF23', 'WDR43', 'APOOL', 'RGPD4-AS1']

858 gene(s) selected for logistic regression across 1+ data set(s):
['MT-CO3', 'PGK1', 'MT-CYB', 'DDIT4', 'LDHA', 'MT-CO2', 'MT-ATP6', 'TMSB10', 'HMGA1', 'KRT19', 'TRIM44', 'MT-TA', 'GPM6A', 'FUT11', 'MT-ND6', 'NDRG1', 'MT-RNR1', 'BAP1', 'PLOD2', 'IGFBP3', 'MT-ND4L', 'TUBA1B', 'HSPH1', 'CSTB', 'AURKA', 'ATXN2L', 'MT-ND1', 'GPI', 'EGLN3', 'BMPR1B', 'BNIP3', 'CAMK2N1', 'NCALD', 'CACNA1A', 'CAV1', 'DHCR7', 'PFKFB3', 'S100A11', 'C4orf3', 'S100A10', 'GOLGA4', 'GATA3', 'GAPDH', 'BTN3A2', 'BTBD9', 'KCNJ2', 'KCNJ3', 'RPSAP48', 'MT-TS1', 'MT-TQ', 'TPD52L1', 'BNIP3L', 'RPS6KA6', 'NEDD4L', 'ALDOC', 'SLC6A8', 'AMOTL2', 'PROSER1', 'FOSL2', 'ALDOA', 'NPM1P40', 'AKT1S1', 'ZC3H15', 'H2AC12', 'AKR1C2', 'AKR1C1', 'EMP2', 'HES1', 'ZNF302', 'CKS2', 'PRRG3', 'ADM', 'CLDN4', 'FAM162A', 'ZNF688', 'P4HA1', 'MAFF', 'H2AC11', 'CAMSAP2', 'YTHDF3', 'RHOD', 'ARMC6', 'KMT2D', 'SLC25A48', 'KPNA2', 'HSP90AA1', 'LGALS1', 'KRT4', 'SLC2A1', 'FEM1A', 'H19', 'FGF23', 'WDR43', 'APOOL', 'RGPD4-AS1', 'GPATCH4', 'GRK2', 'GJB3', 'H2BC4', 'GLE1', 'H2AX', 'GNAQ', 'H2AC20', 'H1-0', 'GPX2', 'GPRC5A', 'GOLGA3', 'H2AC16', 'GSE1', 'GYS1', 'GREM1', 'GAB2', 'GIN1', 'FBRS', 'FLOT2', 'FLNA', 'FGFBP1', 'FGF8', 'FDFT1', 'FBXL18', 'FBXL17', 'FBXL16', 'FASTKD5', 'GFRA1', 'FARP1', 'FAM83A', 'FAM50A', 'FAM189B', 'FAM177A1', 'FAM13B', 'FAM126B', 'FAM111B', 'FN1', 'FOS', 'FOSL1', 'FRS2', 'GDPGP1', 'GDI1', 'GDF15', 'GDAP2', 'GCAT', 'GBP1P1', 'GATAD2A', 'GABRE', 'GABPB2', 'H3C2', 'FYN', 'FYB1', 'FTL', 'FTH1', 'FSD1L', 'FSCN1', 'FRY', 'H2BC9', 'IFI27L2', 'H4C3', 'KITLG', 'KRT80', 'KRT8', 'KRT18', 'KPNA4', 'KNSTRN', 'KLLN', 'KLHL8', 'KLC2', 'KLC1', 'KIRREL1', 'H4C5', 'KIF5B', 'KIF2C', 'KIF23', 'KIF14', 'KHSRP', 'KEAP1', 'KDM5B', 'KDM3A', 'KCTD11', 'KYNU', 'LAD1', 'LAMB3', 'LCLAT1', 'LXN', 'LTBR', 'LRRFIP2', 'LPP', 'LOXL2', 'LMNB2', 'LMNA', 'LINC02541', 'LINC02511', 'LINC02367', 'LINC01902', 'LINC01304', 'LINC01291', 'LINC01133', 'LINC01116', 'LIMCH1', 'LETM1', 'LDLRAP1', 'LDHB', 'KCNQ1OT1', 'KAT7', 'JUP', 'FAM102A', 'ID3', 'HSPD1', 'HSPA8', 'HSPA5', 'HSP90B1', 'HSP90AB1', 'HRH1', 'HOXC13', 'HMGCS1', 'HMGB2', 'HLA-A', 'HIF3A', 'HEY1', 'HES4', 'HERPUD1', 'HEPACAM', 'HELQ', 'HCFC1', 'HBP1', 'IER2', 'IGFBP5', 'JUND', 'ILRUN', 'JUNB', 'JUN', 'JAKMIP3', 'IWS1', 'IVL', 'ITPK1', 'ITGA6', 'ISOC2', 'ISG15', 'ISCU', 'IRF6', 'IRF2BPL', 'IRAK1', 'INSIG1', 'INPP4B', 'INHBA', 'ING2', 'INF2', 'INCENP', 'FAM104A', 'ZWINT', 'FAH', 'BRIP1', 'C7orf50', 'C6orf62', 'C2orf49', 'C1orf53', 'C19orf53', 'C16orf91', 'C10orf55', 'BTBD7P1', 'BRPF3', 'BRMS1', 'BRAT1', 'CDC25B', 'BOLA3', 'BMS1', 'BLOC1S3', 'BLCAP', 'BIRC5', 'BICDL1', 'BHLHE40', 'BCYRN1', 'BCL3', 'BCAS3', 'C9orf78', 'CA9', 'CACHD1', 'CACNB2', 'CD47', 'CD44', 'CCNG2', 'CCNB2', 'CCNB1', 'CCM2', 'CCDC34', 'CCDC18', 'CCDC168', 'CBX3', 'CBFA2T3', 'CAVIN3', 'CAST', 'CASP8AP2', 'CARM1', 'CAPZA1', 'CAP1', 'CANX', 'CALM2', 'CALHM2', 'CACNG4', 'BBOF1', 'BAZ2A', 'BAG3', 'APEH', 'ANKRD9', 'ANKRD52', 'ANKRD40', 'ANKRD17', 'ANKEF1', 'ANGPTL4', 'AMFR', 'ALDH1A3', 'AKR1C3', 'AKAP5', 'AK4', 'AJAP1', 'AHNAK2', 'AFF1', 'ADARB1', 'ADAP1', 'ACTB', 'ACSL4', 'ACAT2', 'ABO', 'ABL1', 'ANXA6', 'ARF3', 'B4GALT1', 'ARFGEF1', 'AXL', 'ATXN1L', 'ATRX', 'ATP5F1E', 'ATN1', 'ATF5', 'ARTN', 'ARSA', 'ARPP19', 'ARNTL2', 'ARNTL', 'ARL2', 'ARL13B', 'ARIH1', 'ARID5B', 'ARID1B', 'ARHGEF7', 'ARHGEF26', 'ARHGDIA', 'ARHGAP42', 'ARHGAP26', 'CDC20', 'CDC6', 'F3', 'DNAJA3', 'DYSF', 'DVL3', 'DUSP9', 'DUSP5', 'DTYMK', 'DTNB', 'DSP', 'DNMT3A', 'DNAJC21', 'DNAJA4', 'DNAJA1', 'CDK2AP2', 'DNAH11', 'DNAAF5', 'DLD', 'DKK1', 'DKC1', 'DHX38', 'DHX37', 'DHRS3', 'DGKZ', 'DGKD', 'EBAG9', 'ECH1', 'EFNA2', 'EFNA5', 'EXOC7', 'ETF1', 'ESRP2', 'ERO1A', 'EPPK1', 'EPHX1', 'ENTR1', 'ENOX2', 'ENO1', 'ENKD1', 'ELP3', 'ELOA', 'EIF5', 'EIF4G2', 'EIF4A2', 'EIF3J', 'EIF3A', 'EIF2B4', 'EHBP1L1', 'LYAR', 'EGLN1', 'DERA', 'DDX54', 'DDX5', 'CNR2', 'CNOT6L', 'CNNM2', 'CMIP', 'CLTB', 'CLSPN', 'CLIP2', 'CLIC1', 'CLDN7', 'CKAP2', 'CITED2', 'CIAO2A', 'CHAC2', 'CHAC1', 'CFAP97', 'CFAP251', 'CERS2', 'CEP83', 'CEP120', 'CENPB', 'CEACAM5', 'CDKN1A', 'CNOT9', 'COL6A3', 'DDX23', 'COX8A', 'DDIT3', 'DCTN1', 'DCBLD2', 'DCAKD', 'DBT', 'DBNDD1', 'DANT1', 'DAAM1', 'CYP1B1', 'CXCL1', 'CTXN1', 'CSNK2A2', 'CSK', 'CS', 'CRTC2', 'CRNDE', 'CRIP2', 'CREB1', 'CPTP', 'CPNE2', 'CPEB1', 'LY6D', 'MIF-AS1', 'MAD2L1', 'SOCS2', 'SRA1', 'SQSTM1', 'SPRY1', 'SPP1', 'SPN', 'SPG21', 'SPATS2L', 'SPAG5', 'SOX4', 'SOS1', 'SNX27', 'SREK1IP1P1', 'SNX24', 'SNX22', 'SNRNP70', 'SNORD3B-1', 'SNHG9', 'SNHG18', 'SMKR1', 'SMIM27', 'SMC6', 'SMC5', 'SRCAP', 'SRFBP1', 'MAP2K3', 'TAF13', 'TCF20', 'TCEAL9', 'TBKBP1', 'TBCA', 'TBC1D9', 'TATDN2', 'TARS1', 'TAOK3', 'TAF9B', 'TAF15', 'SYT14', 'SRM', 'SYNJ2', 'SYNE2', 'SULF2', 'STRIP1', 'STRBP', 'STMN1', 'STC2', 'STARD10', 'SSX2IP', 'SRXN1', 'SMARCB1', 'SLCO4A1', 'SLC9A3R1', 'RPLP0P2', 'S100P', 'S100A2', 'RTL8C', 'RSRC2', 'RRS1', 'RRP1B', 'RRAS', 'RPS29', 'RPS27', 'RPS21', 'RPL41', 'SLC48A1', 'RPL39', 'RPL37A', 'RPL34', 'RPL30', 'RPL28', 'RPL27A', 'RPL23', 'RPL17', 'RPL15', 'RPL13', 'SAMD4A', 'SART1', 'SAT2', 'SCD', 'SLC39A6', 'SLC38A2', 'SLC2A6', 'SLC25A24', 'SLC20A1', 'SLC13A5', 'SLAIN2', 'SINHCAFP3', 'SIGMAR1', 'SHOX', 'SHISA5', 'SH3RF1', 'SF3B4', 'SETD3', 'SETD2', 'SET', 'SERINC5', 'SENP6', 'SEMA4B', 'SECISBP2L', 'SCYL2', 'TCF7L1', 'TCHP', 'TEDC2-AS1', 'VIT', 'YKT6', 'XBP1', 'WWC3', 'WTAPP1', 'WSB2', 'WDR77', 'VRK3', 'VPS9D1-AS1', 'VPS45', 'VMP1', 'VEGFB', 'UBB', 'VCPIP1', 'UTP3', 'UTP18', 'USP35', 'USP32', 'UQCR11', 'UQCC2', 'UPK1B', 'UIMC1', 'UGDH', 'YTHDF1', 'YWHAB', 'YWHAZ', 'ZBED2', 'ZNRF1', 'ZNF764', 'ZNF703', 'ZNF702P', 'ZNF480', 'ZNF418', 'ZNF354A', 'ZNF33B', 'ZNF326', 'ZNF318', 'ZNF263', 'ZNF202', 'ZMIZ1', 'ZHX1', 'ZFP36', 'ZFC3H1', 'ZBTB7A', 'ZBTB34', 'ZBTB20', 'ZBTB2', 'ZBED4', 'UBE2Q2', 'UBA52', 'TFF1', 'TMEM258', 'TOB1', 'TNNT1', 'TNIP2', 'TNFSF13B', 'TNFRSF12A', 'TMSB4XP4', 'TMEM80', 'TMEM70', 'TMEM64', 'TMEM259', 'TMEM256', 'TYSND1', 'TMEM238', 'TIMELESS', 'TIAM1', 'THRB', 'THBS1', 'THAP1', 'TGFB3', 'TGDS', 'TFRC', 'TFF3', 'TOLLIP', 'TPBG', 'TPI1', 'TPM1', 'TXNRD2', 'TXNIP', 'TXN', 'TWNK', 'TUBB6', 'TUBB4B', 'TUBB', 'TTL', 'TSR1', 'TSPYL1', 'TSPO', 'TSHZ2', 'TRIM52-AS1', 'TRIM37', 'TRIM29', 'TRIM16', 'TRAK2', 'TRAK1', 'TRAF3IP2', 'TPX2', 'TPM4', 'RNF25', 'RNF146', 'RNF122', 'MYO5C', 'NCKAP1', 'NCK1', 'NCDN', 'NCBP3', 'NCAM1', 'NBEAL2', 'NAXD', 'NACC1', 'NACA4P', 'NAA10', 'MYO10', 'MT-TV', 'MYH14', 'MYC', 'MXRA5', 'MXI1', 'MTND2P28', 'MTND1P23', 'MTA2', 'MT2A', 'MT1X', 'MT1E', 'NCL', 'NCLN', 'NCOA1', 'NCOA5', 'NSD1', 'NRP1', 'NRG4', 'NQO1', 'NPLOC4', 'NOP10', 'NOM1', 'NOLC1', 'NOL4L', 'NME1-NME2', 'NMD3', 'NLK', 'NINJ1', 'NFIC', 'NEUROD2', 'NEDD9', 'NEDD1', 'NEAT1', 'NDUFC1', 'NDUFB4', 'NDUFA8', 'MT-TY', 'MT-TS2', 'NUP188', 'MELTF-AS1', 'MKNK1', 'MIXL1', 'MIR663AHG', 'MIR210HG', 'MIOS-DT', 'ZRANB1', 'MIF', 'MGRN1', 'MGLL', 'METTL26', 'MED18', 'MT-TP', 'MDM2', 'MCM4', 'MCM3AP', 'MB', 'MAZ', 'MARK3', 'MARK2', 'MARCKS', 'MAPKAPK2', 'MAP3K13', 'MLLT3', 'MLLT6', 'MMP1', 'MMP2', 'MT-TN', 'MT-TM', 'MT-TL1', 'MT-TE', 'MT-TD', 'MT-RNR2', 'MT-ND5', 'MT-ND4', 'MT-ND3', 'MT-ND2', 'MT-CO1', 'MT-ATP8', 'MSR1', 'MSMO1', 'MSMB', 'MRPL55', 'MRNIP', 'MPHOSPH9', 'MPHOSPH6', 'MPDU1', 'MNS1', 'NT5C', 'NUP93', 'RHOT2', 'PRR12', 'PSMD2', 'PSMD14', 'PSMA7', 'PSIP1', 'PRXL2C', 'PRSS23', 'PRRC2C', 'PRRC2A', 'PRR5L', 'PRR34-AS1', 'PRNP', 'POLR3GL', 'PRMT6', 'PREX1', 'PRDX1', 'PRC1', 'PPTC7', 'PPP4R2', 'PPP1R12B', 'PPM1G', 'PPIL1', 'PPIG', 'PSMD5', 'PSME4', 'PSMG1', 'PTGR1', 'RHBDD2', 'RGS10', 'RFK', 'RCC1L', 'RBSN', 'RBBP6', 'RAPGEF3', 'RAI14', 'RAD23A', 'RABEP1', 'RAB5C', 'RAB3GAP1', 'RAB35', 'RAB30', 'RAB2B', 'RAB27A', 'RAB1B', 'RAB12', 'RAB11FIP4', 'PYGO2', 'PTP4A2', 'PPIF', 'POLR3A', 'NUPR2', 'PCDH1', 'PFDN4', 'PERP', 'PDS5A', 'PDLIM1', 'PDCD4', 'PDAP1', 'PCYT1A', 'PCNA', 'PCDHGA10', 'PCDHB1', 'PATL1', 'POLR2A', 'PARD6B', 'PAQR8', 'PAQR7', 'PAPOLA', 'PAK2', 'PACS1', 'P4HA2', 'OVOL1', 'OTUD7B', 'OPTN', 'PGAM1', 'PGAM5', 'PHACTR1', 'PHF20L1', 'POLE4', 'POLDIP2', 'POLB', 'PMEPA1', 'PLK2', 'PLIN2', 'PLEC', 'PLD1', 'PLCE1', 'PLCD3', 'PLCB4', 'PLBD2', 'PLAU', 'PKM', 'PKIB', 'PITX1', 'PITPNA', 'PICALM', 'PI4KB', 'PHRF1', 'PHLDA2', 'AAMP']

In [ ]:
svm_feature_lists = [ss_mcf7_svm_features, ss_hcc_svm_features, ds_mcf7_svm_features, ds_hcc_svm_features]
top_svm_features_occurrences = count_and_sort_occurrences(svm_feature_lists, False)
top_svm_features = {}

for i in range(4, 0, -1):
    top_svm_features[i] = top_svm_features.get(i + 1, []) + filter_by_occurrences(top_svm_features_occurrences, i)

for i in range(4, 0, -1):
    if len(top_svm_features[i]) == 0:
        continue
    print(f"{len(top_svm_features[i])} gene(s) selected for SVM across {i}+ data set(s):")
    print(top_svm_features[i])
    print()
2 gene(s) selected for SVM across 4+ data set(s):
['PGK1', 'MT-CYB']

7 gene(s) selected for SVM across 3+ data set(s):
['PGK1', 'MT-CYB', 'LDHA', 'TMSB10', 'DDIT4', 'MT-CO3', 'MT-CO2']

70 gene(s) selected for SVM across 2+ data set(s):
['PGK1', 'MT-CYB', 'LDHA', 'TMSB10', 'DDIT4', 'MT-CO3', 'MT-CO2', 'S100A10', 'RPSAP48', 'TPD52L1', 'RPS6KA6', 'BAP1', 'ATXN2L', 'NEDD4L', 'TRIM44', 'BMPR1B', 'MT-TS1', 'MT-TQ', 'KRT19', 'TWNK', 'FGF23', 'FEM1A', 'MT-TA', 'KMT2D', 'TMEM64', 'KCNJ3', 'KCNJ2', 'HMGA1', 'NPM1P40', 'H2AC12', 'H2AC11', 'H19', 'STC2', 'CAV1', 'GPM6A', 'SLC25A48', 'GOLGA4', 'CAMSAP2', 'CAMK2N1', 'CACNA1A', 'BTN3A2', 'BTBD9', 'GATA3', 'GAPDH', 'NCALD', 'S100A11', 'ARMC6', 'HEPACAM', 'DSP', 'MAFF', 'RHOD', 'ZBTB20', 'YTHDF3', 'ZNF302', 'FAM162A', 'ZC3H15', 'MT-ATP6', 'AKT1S1', 'ZNF688', 'MT-ND6', 'AKR1C2', 'PRRG3', 'RGPD4-AS1', 'APOOL', 'WDR43', 'EMP2', 'MT-ND1', 'PROSER1', 'LGALS1', 'MT-ND4L']

757 gene(s) selected for SVM across 1+ data set(s):
['PGK1', 'MT-CYB', 'LDHA', 'TMSB10', 'DDIT4', 'MT-CO3', 'MT-CO2', 'S100A10', 'RPSAP48', 'TPD52L1', 'RPS6KA6', 'BAP1', 'ATXN2L', 'NEDD4L', 'TRIM44', 'BMPR1B', 'MT-TS1', 'MT-TQ', 'KRT19', 'TWNK', 'FGF23', 'FEM1A', 'MT-TA', 'KMT2D', 'TMEM64', 'KCNJ3', 'KCNJ2', 'HMGA1', 'NPM1P40', 'H2AC12', 'H2AC11', 'H19', 'STC2', 'CAV1', 'GPM6A', 'SLC25A48', 'GOLGA4', 'CAMSAP2', 'CAMK2N1', 'CACNA1A', 'BTN3A2', 'BTBD9', 'GATA3', 'GAPDH', 'NCALD', 'S100A11', 'ARMC6', 'HEPACAM', 'DSP', 'MAFF', 'RHOD', 'ZBTB20', 'YTHDF3', 'ZNF302', 'FAM162A', 'ZC3H15', 'MT-ATP6', 'AKT1S1', 'ZNF688', 'MT-ND6', 'AKR1C2', 'PRRG3', 'RGPD4-AS1', 'APOOL', 'WDR43', 'EMP2', 'MT-ND1', 'PROSER1', 'LGALS1', 'MT-ND4L', 'GOLGA3', 'FAM111B', 'GNAQ', 'FAM104A', 'GPATCH4', 'GPI', 'GLE1', 'FAM126B', 'FBXL18', 'GREM1', 'FAM102A', 'FAH', 'H4C5', 'H3C2', 'H2BC9', 'H2BC4', 'H2AC20', 'H2AC16', 'EPHX1', 'EPPK1', 'ESRP2', 'GYS1', 'GSE1', 'ETF1', 'GRK2', 'GIN1', 'EXOC7', 'GJB3', 'FAM13B', 'GDPGP1', 'FRY', 'FBXL17', 'FGD5-AS1', 'FGF8', 'FBXL16', 'FLNA', 'FLOT2', 'FOS', 'FOSL1', 'HCFC1', 'FRS2', 'FBRS', 'FASTKD5', 'FARP1', 'FSD1L', 'GDI1', 'FUT11', 'FYN', 'FAM50A', 'FAM189B', 'FAM177A1', 'GAB2', 'GABPB2', 'GABRE', 'GATAD2A', 'GBP1P1', 'GCAT', 'GDAP2', 'GDF15', 'FOSL2', 'ZRANB1', 'HELQ', 'KRT80', 'LINC01291', 'LINC01133', 'LINC01116', 'LIMCH1', 'LETM1', 'LDLRAP1', 'LCLAT1', 'LAD1', 'KRT4', 'HES1', 'KPNA4', 'KPNA2', 'KLLN', 'KLHL8', 'KLC2', 'KLC1', 'KITLG', 'KIRREL1', 'LINC01304', 'LINC01902', 'LINC02367', 'LINC02511', 'MCM3AP', 'MB', 'MAZ', 'MARK3', 'MARK2', 'MARCKS', 'MAPKAPK2', 'MAP3K13', 'MAP2K3', 'MAD2L1', 'LYAR', 'LXN', 'LTBR', 'LRRFIP2', 'LPP', 'LMNB2', 'LINC02541', 'KIF5B', 'KIF14', 'KHSRP', 'IGFBP5', 'ENOX2', 'IFITM3', 'IFI27L2', 'HSPH1', 'HSPD1', 'HSPA8', 'HSPA5', 'HSP90AB1', 'HSP90AA1', 'HPCAL1', 'HOXC13', 'HNRNPA2B1', 'HMGB2', 'HILPDA', 'HIF3A', 'HEY1', 'HES4', 'IGFBP3', 'ILRUN', 'KEAP1', 'INCENP', 'KDM3A', 'KCNQ1OT1', 'KAT7', 'JUND', 'JUN', 'JAKMIP3', 'IWS1', 'IVL', 'ITPK1', 'ISOC2', 'ISCU', 'IRF2BPL', 'IRAK1', 'INPP4B', 'INHBA', 'ING2', 'INF2', 'ENTR1', 'DYNC2I2', 'ENO1', 'BRIP1', 'BOLA3', 'BNIP3L', 'BNIP3', 'BMS1', 'BLOC1S3', 'BICDL1', 'BEND7', 'BCYRN1', 'BCL3', 'BCAS3', 'BBOF1', 'BAZ2A', 'B4GALT1', 'AXL', 'AURKA', 'ATXN1L', 'ATRX', 'ATP9A', 'ATP5F1E', 'BRAT1', 'BRMS1', 'ENKD1', 'BRPF3', 'CAST', 'CASP8AP2', 'CARM1', 'CAPZA1', 'CAP1', 'CALM2', 'CALHM2', 'CACNG4', 'CACNB2', 'CACHD1', 'C9orf78', 'C7orf50', 'C6orf62', 'C4orf3', 'C2orf49', 'C1orf53', 'C19orf53', 'C16orf91', 'BTBD7P1', 'ATN1', 'ATF5', 'ARTN', 'ARSA', 'ANKEF1', 'ANGPTL4', 'AMOTL2', 'AMFR', 'ALDOC', 'ALDOA', 'AKR1C3', 'AKR1C1', 'AKAP5', 'AK4', 'AJAP1', 'AHNAK2', 'AFF1', 'ADM', 'ADARB1', 'ACTB', 'ACSL4', 'ABO', 'ABL1', 'ANKRD17', 'ANKRD40', 'ANKRD52', 'ARID1B', 'ARPP19', 'ARPC1B', 'ARNTL2', 'ARNTL', 'ARL2', 'ARL13B', 'ARIH1', 'ARID5B', 'ARHGEF7', 'ANKRD9', 'ARHGEF26', 'ARHGDIA', 'ARHGAP42', 'ARHGAP26', 'ARFGEF1', 'ARF3', 'APEH', 'ANXA6', 'CAVIN3', 'CBFA2T3', 'CBX3', 'DAAM1', 'DNAJA3', 'DNAJA1', 'DNAH11', 'DNAAF5', 'DLD', 'DKK1', 'DKC1', 'DHX38', 'DHX37', 'DGKZ', 'DGKD', 'DERA', 'DDX54', 'DDX23', 'DDIT3', 'DCTN1', 'DCAKD', 'DBT', 'DBNDD1', 'DNAJA4', 'DNAJC21', 'DNMT3A', 'EFNA5', 'ELP3', 'ELOA', 'EIF4G2', 'EIF3J', 'EIF3A', 'EIF2B4', 'EHBP1L1', 'EGLN3', 'EFNA2', 'DTNB', 'ECH1', 'EBAG9', 'DYSF', 'MCM7', 'DVL3', 'DUSP9', 'DUSP5', 'DTYMK', 'DANT1', 'CTXN1', 'CCDC168', 'CSTB', 'CLIC1', 'CLDN4', 'CKS2', 'CITED2', 'CIAO2A', 'CHAC2', 'CFAP97', 'CFAP251', 'CERS2', 'CEP83', 'CEP120', 'CENPB', 'CDC20', 'CD9', 'CD47', 'CD44', 'CCNG2', 'CCDC34', 'CCDC18', 'CLIP2', 'CLSPN', 'CLTB', 'CPTP', 'CSNK2A2', 'CSK', 'CS', 'CRTC2', 'CRNDE', 'CRIP2', 'CREB1', 'CRABP2', 'CPNE2', 'CMIP', 'CPEB4', 'CPEB1', 'COX8A', 'COL6A3', 'CNR2', 'CNOT9', 'CNOT6L', 'CNNM2', 'MCM4', 'MPHOSPH6', 'MDM2', 'SPG21', 'SOX4', 'SOS1', 'SOCS2', 'SNX27', 'SNX24', 'SNX22', 'SNRNP70', 'SNORD3B-1', 'SNHG9', 'SNHG18', 'SMKR1', 'SMIM27', 'SMC6', 'SMC5', 'SMARCB1', 'SLC6A8', 'SLC48A1', 'SLC2A6', 'SLC2A1', 'SPATS2L', 'SPN', 'MED18', 'SPRY1', 'TAOK3', 'TAF9B', 'TAF15', 'TAF13', 'SYTL2', 'SYT14', 'SYNJ2', 'SYNE2', 'SULF2', 'STRIP1', 'STRBP', 'STMN1', 'STARD10', 'SSX2IP', 'SRSF8', 'SRFBP1', 'SREK1IP1P1', 'SRCAP', 'SRA1', 'SLC25A24', 'SLC13A5', 'SLAIN2', 'SINHCAFP3', 'RPS29', 'RPS27', 'RPS21', 'RPLP0P2', 'RPL41', 'RPL39', 'RPL37A', 'RPL34', 'RPL30', 'RPL28', 'RPL27A', 'RPL23', 'RPL17', 'RPL15', 'RPL13', 'RNF25', 'RNF146', 'RNF122', 'RHOT2', 'RRAS', 'RRP1B', 'RRS1', 'SERINC5', 'SIGMAR1', 'SHOX', 'SHISA5', 'SH3RF1', 'SF3B4', 'SETD3', 'SETD2', 'SET', 'SENP6', 'RSRC2', 'SENP3', 'SECISBP2L', 'SCYL2', 'SAT2', 'SART1', 'SAMD4A', 'S100P', 'RTL8C', 'TARS1', 'TATDN2', 'TBC1D9', 'UBA52', 'YTHDF1', 'YKT6', 'XBP1', 'WWC3', 'WSB2', 'WDR77', 'VRK3', 'VPS9D1-AS1', 'VPS45', 'VMP1', 'VIT', 'VEGFB', 'UTP3', 'UTP18', 'USP35', 'USP32', 'UQCR11', 'UQCC2', 'UIMC1', 'YWHAB', 'YWHAZ', 'ZBED2', 'ZNF318', 'ZNF764', 'ZNF703', 'ZNF702P', 'ZNF480', 'ZNF418', 'ZNF354A', 'ZNF33B', 'ZNF326', 'ZNF263', 'ZBED4', 'ZNF202', 'ZMIZ1', 'ZHX1', 'ZFP36', 'ZFC3H1', 'ZBTB7A', 'ZBTB34', 'ZBTB2', 'UBE2Q2', 'TXNRD2', 'TBCA', 'TXN', 'TMEM70', 'TMEM259', 'TMEM258', 'TMEM256', 'TMEM238', 'TIMELESS', 'TIAM1', 'THRB', 'THAP1', 'TGFB3', 'TGDS', 'TFF3', 'TFF1', 'TEDC2-AS1', 'TCHP', 'TCF7L1', 'TCF20', 'TCEAL9', 'TBKBP1', 'TMEM80', 'TMSB4XP4', 'TNFRSF12A', 'TRAK2', 'TUBB6', 'TUBA1B', 'TTL', 'TSR1', 'TSPYL1', 'TSHZ2', 'TRIM52-AS1', 'TRIM37', 'TRAK1', 'TNFSF13B', 'TRAF3IP2', 'TPM4', 'TPM1', 'TPI1', 'TOLLIP', 'TOB1', 'TNNT1', 'TNIP2', 'RHBDD2', 'RGS10', 'RFK', 'MTND2P28', 'NCOA5', 'NCOA1', 'NCLN', 'NCL', 'NCKAP1', 'NCK1', 'NCDN', 'NCBP3', 'NCAM1', 'NBEAL2', 'NAXD', 'NACC1', 'NACA4P', 'NAA10', 'MYO5C', 'MYO10', 'MYH14', 'MXRA5', 'MXI1', 'NDRG1', 'NDUFA8', 'NDUFB4', 'NOL4L', 'NUP93', 'NSD1', 'NRG4', 'NR4A1', 'NQO1', 'NPLOC4', 'NOP10', 'NOM1', 'NME1-NME2', 'NDUFC1', 'NMD3', 'NLK', 'NINJ1', 'NFIC', 'NEUROD2', 'NEDD9', 'NEDD1', 'NEAT1', 'MUL1', 'MTND1P23', 'OPTN', 'MTA2', 'MRPS2', 'MRPL55', 'MPHOSPH9', 'ZNRF1', 'MPDU1', 'MNS1', 'MMP2', 'MMP1', 'MLLT6', 'MLLT3', 'MKNK1', 'MIXL1', 'MIR663AHG', 'MIR210HG', 'MIOS-DT', 'MGRN1', 'MGLL', 'METTL26', 'MELTF-AS1', 'MSMB', 'MSR1', 'MT-ATP8', 'MT-TM', 'MT2A', 'MT1X', 'MT1E', 'MT-TY', 'MT-TV', 'MT-TS2', 'MT-TP', 'MT-TN', 'MT-TL1', 'MT-CO1', 'MT-TE', 'MT-TD', 'MT-RNR2', 'MT-RNR1', 'MT-ND5', 'MT-ND4', 'MT-ND3', 'MT-ND2', 'NUPR2', 'OTUD7B', 'RCC1L', 'PPIG', 'PSMD5', 'PSMD2', 'PSMD14', 'PSMA7', 'PSIP1', 'PSAP', 'PRXL2C', 'PRRC2C', 'PRRC2A', 'PRR5L', 'PRR34-AS1', 'PRR12', 'PRMT6', 'PREX1', 'PRDX1', 'PPTC7', 'PPP4R2', 'PPP1R12B', 'PPM1G', 'PSME4', 'PSMG1', 'PTGR1', 'RAB35', 'RBSN', 'RBBP6', 'RAPGEF3', 'RAI14', 'RAD23A', 'RABEP1', 'RAB5C', 'RAB3GAP1', 'RAB30', 'PTP4A2', 'RAB2B', 'RAB27A', 'RAB1B', 'RAB12', 'RAB11FIP4', 'QSOX1', 'PYGO2', 'PUSL1', 'PPIL1', 'POLR3GL', 'OVOL1', 'POLE4', 'PGAM5', 'PGAM1', 'PFKFB3', 'PFDN4', 'PDS5A', 'PDLIM1', 'PDCD4', 'PDAP1', 'PCYT1A', 'PCDHGA10', 'PCDHB1', 'PATL1', 'PARD6B', 'PAQR8', 'PAQR7', 'PAPOLA', 'PAK2', 'PACS1', 'P4HA1', 'PHACTR1', 'PHC1', 'PHF20L1', 'PLCB4', 'POLDIP2', 'POLB', 'PMEPA1', 'PLOD2', 'PLEC', 'PLD1', 'PLCE1', 'PLCD3', 'PLBD2', 'PHLDA2', 'PLAU', 'PKM', 'PKIB', 'PITX1', 'PITPNA', 'PICALM', 'PI4KB', 'PHRF1', 'AAMP']

In [ ]:
random_forest_feature_lists = [ss_mcf7_random_forest_features, ss_hcc_random_forest_features, ds_mcf7_random_forest_features, ds_hcc_random_forest_features]
top_random_forest_features_occurrences = count_and_sort_occurrences(random_forest_feature_lists, False)
top_random_forest_features = {}

for i in range(4, 0, -1):
    top_random_forest_features[i] = top_random_forest_features.get(i + 1, []) + filter_by_occurrences(top_random_forest_features_occurrences, i)

for i in range(4, 0, -1):
    if len(top_random_forest_features[i]) == 0:
        continue
    print(f"{len(top_random_forest_features[i])} gene(s) selected for random forest across {i}+ data set(s):")
    print(top_random_forest_features[i])
    print()
1 gene(s) selected for random forest across 4+ data set(s):
['PGK1']

8 gene(s) selected for random forest across 3+ data set(s):
['PGK1', 'NDRG1', 'BNIP3', 'DSP', 'P4HA1', 'BNIP3L', 'KRT19', 'GAPDH']

51 gene(s) selected for random forest across 2+ data set(s):
['PGK1', 'NDRG1', 'BNIP3', 'DSP', 'P4HA1', 'BNIP3L', 'KRT19', 'GAPDH', 'RPS28', 'RPS27', 'RPS19', 'FUT11', 'RPS5', 'S100A10', 'S100A11', 'RPS14', 'GPI', 'RPLP2', 'RPLP1', 'RPL37A', 'RPL39', 'H4C3', 'RPL36', 'PFKFB3', 'RPL35', 'DSCAM-AS1', 'PKM', 'FGF23', 'RPL13', 'RPL12', 'EGLN3', 'ELOB', 'ENO1', 'ENO2', 'ERO1A', 'SERF2', 'FAM162A', 'MT-CO3', 'MT-CYB', 'TPI1', 'HES1', 'TMSB10', 'LDHA', 'MT-ATP6', 'BCYRN1', 'MT-CO2', 'ALDOA', 'MALAT1', 'ADM', 'UQCRQ', 'MT-RNR2']

242 gene(s) selected for random forest across 1+ data set(s):
['PGK1', 'NDRG1', 'BNIP3', 'DSP', 'P4HA1', 'BNIP3L', 'KRT19', 'GAPDH', 'RPS28', 'RPS27', 'RPS19', 'FUT11', 'RPS5', 'S100A10', 'S100A11', 'RPS14', 'GPI', 'RPLP2', 'RPLP1', 'RPL37A', 'RPL39', 'H4C3', 'RPL36', 'PFKFB3', 'RPL35', 'DSCAM-AS1', 'PKM', 'FGF23', 'RPL13', 'RPL12', 'EGLN3', 'ELOB', 'ENO1', 'ENO2', 'ERO1A', 'SERF2', 'FAM162A', 'MT-CO3', 'MT-CYB', 'TPI1', 'HES1', 'TMSB10', 'LDHA', 'MT-ATP6', 'BCYRN1', 'MT-CO2', 'ALDOA', 'MALAT1', 'ADM', 'UQCRQ', 'MT-RNR2', 'FDPS', 'INSIG1', 'HSPB1', 'FOSL1', 'IGFBP3', 'HSPD1', 'FAM83A', 'IFITM3', 'HSPH1', 'FASN', 'FDFT1', 'HSPA5', 'IFITM2', 'H2AC12', 'HSP90AB1', 'FOSL2', 'HSP90B1', 'H1-3', 'HILPDA', 'FAM13A', 'H1-1', 'HK2', 'GYS1', 'HMGA1', 'HNRNPA2B1', 'GSTP1', 'GPRC5A', 'GPM6A', 'HNRNPM', 'HNRNPU', 'HSP90AA1', 'FTL', 'ZNF473', 'DYNC2I2', 'EZR', 'EMP2', 'C19orf53', 'BUB1B', 'BUB1', 'BTBD9', 'BLCAP', 'BHLHE40', 'BAP1', 'B4GALT1', 'ATP5MK', 'ATP5MG', 'ATP5ME', 'ATP5F1E', 'ATAD2', 'ASB2', 'ARRDC3', 'APEH', 'ANGPTL4', 'ALDOC', 'AKR1C2', 'AKR1C1', 'AHNAK2', 'ACTB', 'ACLY', 'C1orf116', 'C4orf3', 'CA9', 'CYP1B1', 'EIF5', 'EIF3J', 'EGLN1', 'EEF2', 'EEF1A1', 'EBP', 'KCNJ3', 'DNMT1', 'DDIT4', 'CYP1B1-AS1', 'CYB561A3', 'CACNA1A', 'COX7C', 'COX7A2', 'CNNM2', 'CENPF', 'CDKN1A', 'CBX3', 'CAV1', 'CAST', 'CALM2', 'CALB1', 'IRF2BP2', 'MOV10', 'KCTD11', 'KDM3A', 'SNRPD2', 'SNRNP25', 'SLC9A3R1', 'SLC6A8', 'SLC3A2', 'SLC2A1', 'SET', 'S100A6', 'RPSA', 'RPS8', 'RPS3', 'RPS2', 'RPS16', 'RPS15A', 'RPS15', 'RPS12', 'RPL8', 'RPL41', 'RPL37', 'RPL35A', 'RPL34', 'RPL30', 'RPL28', 'SNX33', 'SOX4', 'SQLE', 'TRIM44', 'ZC3H15', 'YWHAZ', 'WDR43', 'VEGFA', 'UPK1B', 'UBC', 'UBA52', 'TUBG1', 'TUBD1', 'TST', 'TPX2', 'SRM', 'TPT1', 'TPBG', 'TOB1', 'TMSB4X', 'TMEM64', 'TMEM45A', 'TMEM258', 'TFF3', 'TFF1', 'STMN1', 'RPL27A', 'RPL23', 'RPL21', 'MT-CO1', 'NEAT1', 'NDUFB2', 'NCL', 'NCALD', 'MTATP6P1', 'MT2A', 'MT-TQ', 'MT-RNR1', 'MT-ND4', 'MT-ND3', 'ZNF302', 'NPM1P40', 'MOB3A', 'MIF', 'MARCKS', 'LOXL2', 'LGALS1', 'LDHB', 'LBH', 'KYNU', 'KRT8', 'KRT18', 'NECTIN2', 'P4HA2', 'RPL15', 'PRRG3', 'RPL11', 'RPL10', 'ROMO1', 'RALGDS', 'RAC1', 'PYCR3', 'PTMS', 'PSME2', 'PSMA7', 'PRSS8', 'PPP1R3G', 'PABPC1', 'POLR2L', 'PLOD2', 'PLIN2', 'PLEC', 'PLAC8', 'PFKP', 'PFKFB4', 'PDK1', 'PDIA3', 'PARD6B', 'ACAT2']

Looking at the selected genes across all the models and data sets, certain genes that tend to have a high predictive power can be identified. The selected genes with higher occurrences may be especially useful for constructing a generalized model.

In [ ]:
feature_lists = logit_feature_lists + svm_feature_lists + random_forest_feature_lists
top_genes_occurrences = count_and_sort_occurrences(feature_lists, False)
top_genes = {}

for i in range(12, 0, -1):
    top_genes[i] = top_genes.get(i + 1, []) + filter_by_occurrences(top_genes_occurrences, i)

for i in range(12, 0, -1):
    if len(top_genes[i]) == 0:
        continue
    print(f"{len(top_genes[i])} gene(s) selected {i}+ times:")
    print(top_genes[i])
    print()
1 gene(s) selected 12+ times:
['PGK1']

1 gene(s) selected 11+ times:
['PGK1']

2 gene(s) selected 10+ times:
['PGK1', 'MT-CYB']

3 gene(s) selected 9+ times:
['PGK1', 'MT-CYB', 'MT-CO3']

7 gene(s) selected 8+ times:
['PGK1', 'MT-CYB', 'MT-CO3', 'MT-CO2', 'TMSB10', 'KRT19', 'LDHA']

10 gene(s) selected 7+ times:
['PGK1', 'MT-CYB', 'MT-CO3', 'MT-CO2', 'TMSB10', 'KRT19', 'LDHA', 'MT-ATP6', 'GAPDH', 'DDIT4']

20 gene(s) selected 6+ times:
['PGK1', 'MT-CYB', 'MT-CO3', 'MT-CO2', 'TMSB10', 'KRT19', 'LDHA', 'MT-ATP6', 'GAPDH', 'DDIT4', 'BNIP3L', 'HMGA1', 'P4HA1', 'BNIP3', 'NDRG1', 'FGF23', 'DSP', 'S100A11', 'S100A10', 'FAM162A']

45 gene(s) selected 5+ times:
['PGK1', 'MT-CYB', 'MT-CO3', 'MT-CO2', 'TMSB10', 'KRT19', 'LDHA', 'MT-ATP6', 'GAPDH', 'DDIT4', 'BNIP3L', 'HMGA1', 'P4HA1', 'BNIP3', 'NDRG1', 'FGF23', 'DSP', 'S100A11', 'S100A10', 'FAM162A', 'AKR1C2', 'MT-TQ', 'H2AC12', 'KCNJ3', 'FUT11', 'CAV1', 'BAP1', 'TRIM44', 'CACNA1A', 'WDR43', 'ADM', 'NPM1P40', 'LGALS1', 'PFKFB3', 'EGLN3', 'GPI', 'EMP2', 'ZNF302', 'HES1', 'NCALD', 'GPM6A', 'BTBD9', 'ZC3H15', 'ALDOA', 'PRRG3']

97 gene(s) selected 4+ times:
['PGK1', 'MT-CYB', 'MT-CO3', 'MT-CO2', 'TMSB10', 'KRT19', 'LDHA', 'MT-ATP6', 'GAPDH', 'DDIT4', 'BNIP3L', 'HMGA1', 'P4HA1', 'BNIP3', 'NDRG1', 'FGF23', 'DSP', 'S100A11', 'S100A10', 'FAM162A', 'AKR1C2', 'MT-TQ', 'H2AC12', 'KCNJ3', 'FUT11', 'CAV1', 'BAP1', 'TRIM44', 'CACNA1A', 'WDR43', 'ADM', 'NPM1P40', 'LGALS1', 'PFKFB3', 'EGLN3', 'GPI', 'EMP2', 'ZNF302', 'HES1', 'NCALD', 'GPM6A', 'BTBD9', 'ZC3H15', 'ALDOA', 'PRRG3', 'CAMK2N1', 'BCYRN1', 'GATA3', 'RPSAP48', 'GOLGA4', 'PROSER1', 'BMPR1B', 'FOSL2', 'CAMSAP2', 'RPS6KA6', 'FEM1A', 'RGPD4-AS1', 'RHOD', 'RPL13', 'TPI1', 'TPD52L1', 'BTN3A2', 'RPS27', 'ENO1', 'C4orf3', 'TMEM64', 'H19', 'HSPH1', 'H2AC11', 'NEDD4L', 'MT-ND1', 'ZNF688', 'MT-ND4L', 'MT-ND6', 'MT-RNR1', 'MT-RNR2', 'MT-TA', 'AKR1C1', 'AKT1S1', 'ALDOC', 'MT-TS1', 'ATXN2L', 'YTHDF3', 'MAFF', 'APOOL', 'KCNJ2', 'PLOD2', 'HSP90AA1', 'RPL39', 'ARMC6', 'SLC6A8', 'IGFBP3', 'PKM', 'SLC25A48', 'KMT2D', 'SLC2A1', 'RPL37A']

155 gene(s) selected 3+ times:
['PGK1', 'MT-CYB', 'MT-CO3', 'MT-CO2', 'TMSB10', 'KRT19', 'LDHA', 'MT-ATP6', 'GAPDH', 'DDIT4', 'BNIP3L', 'HMGA1', 'P4HA1', 'BNIP3', 'NDRG1', 'FGF23', 'DSP', 'S100A11', 'S100A10', 'FAM162A', 'AKR1C2', 'MT-TQ', 'H2AC12', 'KCNJ3', 'FUT11', 'CAV1', 'BAP1', 'TRIM44', 'CACNA1A', 'WDR43', 'ADM', 'NPM1P40', 'LGALS1', 'PFKFB3', 'EGLN3', 'GPI', 'EMP2', 'ZNF302', 'HES1', 'NCALD', 'GPM6A', 'BTBD9', 'ZC3H15', 'ALDOA', 'PRRG3', 'CAMK2N1', 'BCYRN1', 'GATA3', 'RPSAP48', 'GOLGA4', 'PROSER1', 'BMPR1B', 'FOSL2', 'CAMSAP2', 'RPS6KA6', 'FEM1A', 'RGPD4-AS1', 'RHOD', 'RPL13', 'TPI1', 'TPD52L1', 'BTN3A2', 'RPS27', 'ENO1', 'C4orf3', 'TMEM64', 'H19', 'HSPH1', 'H2AC11', 'NEDD4L', 'MT-ND1', 'ZNF688', 'MT-ND4L', 'MT-ND6', 'MT-RNR1', 'MT-RNR2', 'MT-TA', 'AKR1C1', 'AKT1S1', 'ALDOC', 'MT-TS1', 'ATXN2L', 'YTHDF3', 'MAFF', 'APOOL', 'KCNJ2', 'PLOD2', 'HSP90AA1', 'RPL39', 'ARMC6', 'SLC6A8', 'IGFBP3', 'PKM', 'SLC25A48', 'KMT2D', 'SLC2A1', 'RPL37A', 'CNNM2', 'SOX4', 'SET', 'CSTB', 'CLDN4', 'CKS2', 'RPL41', 'EIF3J', 'ERO1A', 'RPL34', 'HSPA5', 'MT-ND3', 'MT-ND4', 'MT2A', 'MARCKS', 'NCL', 'NEAT1', 'KRT4', 'KPNA2', 'PARD6B', 'KDM3A', 'HSPD1', 'PLEC', 'RPL30', 'HSP90AB1', 'HEPACAM', 'H4C3', 'STMN1', 'GYS1', 'PSMA7', 'FOSL1', 'RPL15', 'RPL23', 'RPL27A', 'RPL28', 'STC2', 'MT-CO1', 'ANGPTL4', 'TMEM258', 'YWHAZ', 'UBA52', 'C19orf53', 'ATP5F1E', 'APEH', 'TOB1', 'TFF3', 'AMOTL2', 'CBX3', 'ZBTB20', 'AHNAK2', 'AURKA', 'TWNK', 'ACTB', 'B4GALT1', 'CALM2', 'TFF1', 'CAST', 'TUBA1B']

784 gene(s) selected 2+ times:
['PGK1', 'MT-CYB', 'MT-CO3', 'MT-CO2', 'TMSB10', 'KRT19', 'LDHA', 'MT-ATP6', 'GAPDH', 'DDIT4', 'BNIP3L', 'HMGA1', 'P4HA1', 'BNIP3', 'NDRG1', 'FGF23', 'DSP', 'S100A11', 'S100A10', 'FAM162A', 'AKR1C2', 'MT-TQ', 'H2AC12', 'KCNJ3', 'FUT11', 'CAV1', 'BAP1', 'TRIM44', 'CACNA1A', 'WDR43', 'ADM', 'NPM1P40', 'LGALS1', 'PFKFB3', 'EGLN3', 'GPI', 'EMP2', 'ZNF302', 'HES1', 'NCALD', 'GPM6A', 'BTBD9', 'ZC3H15', 'ALDOA', 'PRRG3', 'CAMK2N1', 'BCYRN1', 'GATA3', 'RPSAP48', 'GOLGA4', 'PROSER1', 'BMPR1B', 'FOSL2', 'CAMSAP2', 'RPS6KA6', 'FEM1A', 'RGPD4-AS1', 'RHOD', 'RPL13', 'TPI1', 'TPD52L1', 'BTN3A2', 'RPS27', 'ENO1', 'C4orf3', 'TMEM64', 'H19', 'HSPH1', 'H2AC11', 'NEDD4L', 'MT-ND1', 'ZNF688', 'MT-ND4L', 'MT-ND6', 'MT-RNR1', 'MT-RNR2', 'MT-TA', 'AKR1C1', 'AKT1S1', 'ALDOC', 'MT-TS1', 'ATXN2L', 'YTHDF3', 'MAFF', 'APOOL', 'KCNJ2', 'PLOD2', 'HSP90AA1', 'RPL39', 'ARMC6', 'SLC6A8', 'IGFBP3', 'PKM', 'SLC25A48', 'KMT2D', 'SLC2A1', 'RPL37A', 'CNNM2', 'SOX4', 'SET', 'CSTB', 'CLDN4', 'CKS2', 'RPL41', 'EIF3J', 'ERO1A', 'RPL34', 'HSPA5', 'MT-ND3', 'MT-ND4', 'MT2A', 'MARCKS', 'NCL', 'NEAT1', 'KRT4', 'KPNA2', 'PARD6B', 'KDM3A', 'HSPD1', 'PLEC', 'RPL30', 'HSP90AB1', 'HEPACAM', 'H4C3', 'STMN1', 'GYS1', 'PSMA7', 'FOSL1', 'RPL15', 'RPL23', 'RPL27A', 'RPL28', 'STC2', 'MT-CO1', 'ANGPTL4', 'TMEM258', 'YWHAZ', 'UBA52', 'C19orf53', 'ATP5F1E', 'APEH', 'TOB1', 'TFF3', 'AMOTL2', 'CBX3', 'ZBTB20', 'AHNAK2', 'AURKA', 'TWNK', 'ACTB', 'B4GALT1', 'CALM2', 'TFF1', 'CAST', 'TUBA1B', 'IVL', 'IWS1', 'ARIH1', 'IRF2BPL', 'ITPK1', 'ISOC2', 'ISCU', 'IRAK1', 'INSIG1', 'INPP4B', 'JAKMIP3', 'ARID1B', 'JUN', 'KIF5B', 'KLLN', 'KLHL8', 'KLC2', 'KLC1', 'KITLG', 'KIRREL1', 'ARHGEF26', 'KIF14', 'JUND', 'KHSRP', 'KEAP1', 'ARHGEF7', 'KCTD11', 'KCNQ1OT1', 'ARID5B', 'KAT7', 'INHBA', 'CDKN1A', 'ING2', 'HMGB2', 'HIF3A', 'HEY1', 'HES4', 'ATRX', 'HELQ', 'HCFC1', 'H4C5', 'ATXN1L', 'H3C2', 'H2BC9', 'H2BC4', 'H2AC20', 'H2AC16', 'AXL', 'BAZ2A', 'GSE1', 'GRK2', 'HILPDA', 'HNRNPA2B1', 'INF2', 'HOXC13', 'INCENP', 'ILRUN', 'IGFBP5', 'ARHGAP42', 'ARL13B', 'ARL2', 'IFITM3', 'IFI27L2', 'ARNTL', 'ARNTL2', 'HSPA8', 'ARPP19', 'HSP90B1', 'ARSA', 'ARTN', 'ATF5', 'ATN1', 'ARHGDIA', 'ARF3', 'ARHGAP26', 'MARK3', 'MLLT6', 'MLLT3', 'MKNK1', 'MIXL1', 'MIR663AHG', 'MIR210HG', 'MIOS-DT', 'MIF', 'MGRN1', 'MGLL', 'METTL26', 'MELTF-AS1', 'MED18', 'MDM2', 'MCM4', 'MCM3AP', 'MB', 'MMP1', 'AMFR', 'AKR1C3', 'MSMB', 'ABL1', 'ABO', 'ACAT2', 'ACSL4', 'MSR1', 'ADARB1', 'AFF1', 'AJAP1', 'MMP2', 'AK4', 'AKAP5', 'MRPL55', 'MPHOSPH9', 'MPHOSPH6', 'MPDU1', 'ZRANB1', 'MAZ', 'MARK2', 'KPNA4', 'MAPKAPK2', 'LINC01291', 'LINC01133', 'LINC01116', 'LIMCH1', 'ANKRD9', 'LETM1', 'LDLRAP1', 'ANXA6', 'LDHB', 'LCLAT1', 'LAD1', 'KYNU', 'KRT80', 'KRT8', 'BBOF1', 'ARFGEF1', 'KRT18', 'LINC01304', 'LINC01902', 'LINC02367', 'LXN', 'MAP3K13', 'MAP2K3', 'MALAT1', 'ANKEF1', 'MAD2L1', 'LYAR', 'ANKRD17', 'LTBR', 'LINC02511', 'LRRFIP2', 'LPP', 'ANKRD40', 'LOXL2', 'LMNB2', 'ANKRD52', 'LINC02541', 'GREM1', 'BCL3', 'GPRC5A', 'DNAJA3', 'DNAH11', 'DNAAF5', 'DLD', 'DKK1', 'DKC1', 'DHX38', 'DHX37', 'CAVIN3', 'DHCR7', 'DGKZ', 'DGKD', 'DERA', 'DDX54', 'CBFA2T3', 'DDX23', 'DDIT3', 'DCTN1', 'DNAJA1', 'DNAJA4', 'DBT', 'DNAJC21', 'CAPZA1', 'CARM1', 'CASP8AP2', 'EGLN1', 'EFNA5', 'EFNA2', 'ECH1', 'EBAG9', 'DYSF', 'DYNC2I2', 'DVL3', 'DUSP9', 'DUSP5', 'DTYMK', 'DTNB', 'DSCAM-AS1', 'DNMT3A', 'DCAKD', 'DBNDD1', 'EIF2B4', 'CNOT6L', 'CMIP', 'CLTB', 'CLSPN', 'CLIP2', 'CLIC1', 'CCNG2', 'CD44', 'CITED2', 'CD47', 'CDC20', 'CIAO2A', 'CHAC2', 'CFAP97', 'CFAP251', 'CERS2', 'CEP83', 'CEP120', 'CCDC34', 'CNOT9', 'DANT1', 'CNR2', 'DAAM1', 'CYP1B1', 'CTXN1', 'CCDC168', 'CSNK2A2', 'CSK', 'CS', 'CRTC2', 'CRNDE', 'CRIP2', 'CREB1', 'CCDC18', 'CPTP', 'CPNE2', 'CPEB1', 'COX8A', 'COL6A3', 'EHBP1L1', 'EIF3A', 'BCAS3', 'GAB2', 'BLOC1S3', 'FTL', 'FSD1L', 'FRY', 'FRS2', 'BMS1', 'BOLA3', 'FOS', 'FLOT2', 'FLNA', 'FGF8', 'BRAT1', 'BRIP1', 'FDFT1', 'BRMS1', 'FBXL18', 'FBXL17', 'FYN', 'GABPB2', 'FBXL16', 'GABRE', 'GPATCH4', 'CENPB', 'BHLHE40', 'GOLGA3', 'GNAQ', 'GLE1', 'GJB3', 'GIN1', 'GDPGP1', 'GDI1', 'GDF15', 'GDAP2', 'GCAT', 'GBP1P1', 'GATAD2A', 'BICDL1', 'BLCAP', 'BRPF3', 'FBRS', 'CAP1', 'FAH', 'ETF1', 'ESRP2', 'MT-ATP8', 'EPPK1', 'EPHX1', 'ENTR1', 'ENOX2', 'ENO2', 'ENKD1', 'CACNB2', 'CACNG4', 'ELP3', 'ELOB', 'ELOA', 'CALHM2', 'EIF5', 'EIF4G2', 'EXOC7', 'CACHD1', 'FASTKD5', 'CA9', 'FARP1', 'FAM83A', 'FAM50A', 'FAM189B', 'FAM177A1', 'BTBD7P1', 'C16orf91', 'FAM13B', 'C1orf53', 'C2orf49', 'C6orf62', 'C7orf50', 'FAM126B', 'FAM111B', 'FAM104A', 'FAM102A', 'C9orf78', 'MNS1', 'AAMP', 'SYT14', 'RAB27A', 'TSHZ2', 'RAB3GAP1', 'SNRNP70', 'RAB35', 'RAB30', 'TBC1D9', 'RAB2B', 'RAB1B', 'TRAF3IP2', 'RAB12', 'RAB11FIP4', 'TSPYL1', 'PYGO2', 'TSR1', 'SNX22', 'PTP4A2', 'RAB5C', 'RABEP1', 'TRIM52-AS1', 'RAD23A', 'RNF122', 'RHOT2', 'RHBDD2', 'RGS10', 'RFK', 'TRAK1', 'RCC1L', 'TRAK2', 'RBSN', 'RBBP6', 'TBKBP1', 'RAPGEF3', 'TBCA', 'TRIM37', 'RAI14', 'SNX24', 'PTGR1', 'PSMG1', 'SPG21', 'PRR12', 'SPN', 'PRMT6', 'PREX1', 'PRDX1', 'TARS1', 'TAOK3', 'PPTC7', 'PPP4R2', 'SPRY1', 'PPP1R12B', 'TAF9B', 'PPM1G', 'PPIL1', 'PPIG', 'TXNRD2', 'TXN', 'TTL', 'PRR34-AS1', 'PSME4', 'SNX27', 'PSMD5', 'PSMD2', 'PSMD14', 'PSIP1', 'SOCS2', 'PRXL2C', 'TUBB6', 'SOS1', 'SPATS2L', 'PRRC2C', 'PRRC2A', 'PRR5L', 'TATDN2', 'RNF146', 'RNF25', 'TCHP', 'SCYL2', 'SETD2', 'SERINC5', 'SERF2', 'SENP6', 'THRB', 'TIAM1', 'SECISBP2L', 'TIMELESS', 'TPX2', 'SAT2', 'SART1', 'SMC6', 'SAMD4A', 'TMEM238', 'TMEM256', 'S100P', 'TCF7L1', 'SETD3', 'SF3B4', 'SH3RF1', 'TEDC2-AS1', 'SLC48A1', 'SMARCB1', 'TGDS', 'TGFB3', 'SLC2A6', 'THAP1', 'SLC25A24', 'SMC5', 'SLC13A5', 'SLAIN2', 'SINHCAFP3', 'SIGMAR1', 'SHOX', 'SHISA5', 'TMEM259', 'SMIM27', 'RTL8C', 'SNHG18', 'RPLP1', 'RPLP0P2', 'TNNT1', 'TOLLIP', 'SNHG9', 'TCF20', 'RPL36', 'SNORD3B-1', 'RPL35', 'TPBG', 'TPM1', 'RPL17', 'RPL12', 'TPM4', 'TCEAL9', 'RPLP2', 'RPS14', 'RSRC2', 'TNIP2', 'RRS1', 'RRP1B', 'RRAS', 'TMEM70', 'TMEM80', 'SMKR1', 'RPS5', 'TMSB4XP4', 'RPS29', 'RPS28', 'RPS21', 'TNFRSF12A', 'RPS19', 'TNFSF13B', 'MT-ND2', 'SRA1', 'SRCAP', 'POLR3GL', 'NDUFB4', 'NFIC', 'NEUROD2', 'NEDD9', 'ZNF263', 'NEDD1', 'ZNF318', 'NDUFC1', 'ZNF326', 'UBE2Q2', 'NDUFA8', 'ZNF33B', 'NCOA5', 'NCOA1', 'NCLN', 'NCKAP1', 'NCK1', 'NINJ1', 'NLK', 'NMD3', 'NME1-NME2', 'NUP93', 'ZBTB7A', 'ZFC3H1', 'NSD1', 'TAF13', 'ZFP36', 'ZHX1', 'NRG4', 'ZMIZ1', 'NQO1', 'NPLOC4', 'NOP10', 'NOM1', 'ZNF202', 'NOL4L', 'NCDN', 'NCBP3', 'NCAM1', 'MT1X', 'MT-TY', 'MT-TV', 'MT-TS2', 'MT-TP', 'MT-TN', 'ZNF702P', 'SYNJ2', 'MT-TM', 'MT-TL1', 'MT-TE', 'ZNF703', 'MT-TD', 'ZNF764', 'MT-ND5', 'ZNRF1', 'MT1E', 'MTA2', 'NBEAL2', 'ZNF480', 'NAXD', 'NACC1', 'NACA4P', 'NAA10', 'MYO5C', 'ZNF354A', 'MYO10', 'MYH14', 'ZNF418', 'MXRA5', 'MXI1', 'SULF2', 'MTND2P28', 'SYNE2', 'MTND1P23', 'NUPR2', 'OPTN', 'OTUD7B', 'PLCD3', 'PLCB4', 'PLBD2', 'USP35', 'UTP18', 'PLAU', 'UTP3', 'STARD10', 'PKIB', 'PITX1', 'PITPNA', 'PICALM', 'PI4KB', 'PHRF1', 'PHLDA2', 'PHF20L1', 'USP32', 'PLCE1', 'PHACTR1', 'PLD1', 'SREK1IP1P1', 'SRFBP1', 'SRM', 'UIMC1', 'POLE4', 'POLDIP2', 'UPK1B', 'POLB', 'TAF15', 'UQCC2', 'PMEPA1', 'SSX2IP', 'UQCR11', 'UQCRQ', 'PLIN2', 'STRBP', 'VEGFB', 'OVOL1', 'PCYT1A', 'YTHDF1', 'PCDHGA10', 'PCDHB1', 'YWHAB', 'ZBED2', 'ZBED4', 'PATL1', 'PAQR8', 'PAQR7', 'PAPOLA', 'PAK2', 'PACS1', 'ZBTB2', 'P4HA2', 'ZBTB34', 'YKT6', 'PDAP1', 'PFDN4', 'PDCD4', 'VIT', 'PGAM5', 'VMP1', 'PGAM1', 'VPS45', 'VPS9D1-AS1', 'VRK3', 'SLC9A3R1', 'WDR77', 'PDS5A', 'WSB2', 'PDLIM1', 'STRIP1', 'WWC3', 'XBP1']

979 gene(s) selected 1+ times:
['PGK1', 'MT-CYB', 'MT-CO3', 'MT-CO2', 'TMSB10', 'KRT19', 'LDHA', 'MT-ATP6', 'GAPDH', 'DDIT4', 'BNIP3L', 'HMGA1', 'P4HA1', 'BNIP3', 'NDRG1', 'FGF23', 'DSP', 'S100A11', 'S100A10', 'FAM162A', 'AKR1C2', 'MT-TQ', 'H2AC12', 'KCNJ3', 'FUT11', 'CAV1', 'BAP1', 'TRIM44', 'CACNA1A', 'WDR43', 'ADM', 'NPM1P40', 'LGALS1', 'PFKFB3', 'EGLN3', 'GPI', 'EMP2', 'ZNF302', 'HES1', 'NCALD', 'GPM6A', 'BTBD9', 'ZC3H15', 'ALDOA', 'PRRG3', 'CAMK2N1', 'BCYRN1', 'GATA3', 'RPSAP48', 'GOLGA4', 'PROSER1', 'BMPR1B', 'FOSL2', 'CAMSAP2', 'RPS6KA6', 'FEM1A', 'RGPD4-AS1', 'RHOD', 'RPL13', 'TPI1', 'TPD52L1', 'BTN3A2', 'RPS27', 'ENO1', 'C4orf3', 'TMEM64', 'H19', 'HSPH1', 'H2AC11', 'NEDD4L', 'MT-ND1', 'ZNF688', 'MT-ND4L', 'MT-ND6', 'MT-RNR1', 'MT-RNR2', 'MT-TA', 'AKR1C1', 'AKT1S1', 'ALDOC', 'MT-TS1', 'ATXN2L', 'YTHDF3', 'MAFF', 'APOOL', 'KCNJ2', 'PLOD2', 'HSP90AA1', 'RPL39', 'ARMC6', 'SLC6A8', 'IGFBP3', 'PKM', 'SLC25A48', 'KMT2D', 'SLC2A1', 'RPL37A', 'CNNM2', 'SOX4', 'SET', 'CSTB', 'CLDN4', 'CKS2', 'RPL41', 'EIF3J', 'ERO1A', 'RPL34', 'HSPA5', 'MT-ND3', 'MT-ND4', 'MT2A', 'MARCKS', 'NCL', 'NEAT1', 'KRT4', 'KPNA2', 'PARD6B', 'KDM3A', 'HSPD1', 'PLEC', 'RPL30', 'HSP90AB1', 'HEPACAM', 'H4C3', 'STMN1', 'GYS1', 'PSMA7', 'FOSL1', 'RPL15', 'RPL23', 'RPL27A', 'RPL28', 'STC2', 'MT-CO1', 'ANGPTL4', 'TMEM258', 'YWHAZ', 'UBA52', 'C19orf53', 'ATP5F1E', 'APEH', 'TOB1', 'TFF3', 'AMOTL2', 'CBX3', 'ZBTB20', 'AHNAK2', 'AURKA', 'TWNK', 'ACTB', 'B4GALT1', 'CALM2', 'TFF1', 'CAST', 'TUBA1B', 'IVL', 'IWS1', 'ARIH1', 'IRF2BPL', 'ITPK1', 'ISOC2', 'ISCU', 'IRAK1', 'INSIG1', 'INPP4B', 'JAKMIP3', 'ARID1B', 'JUN', 'KIF5B', 'KLLN', 'KLHL8', 'KLC2', 'KLC1', 'KITLG', 'KIRREL1', 'ARHGEF26', 'KIF14', 'JUND', 'KHSRP', 'KEAP1', 'ARHGEF7', 'KCTD11', 'KCNQ1OT1', 'ARID5B', 'KAT7', 'INHBA', 'CDKN1A', 'ING2', 'HMGB2', 'HIF3A', 'HEY1', 'HES4', 'ATRX', 'HELQ', 'HCFC1', 'H4C5', 'ATXN1L', 'H3C2', 'H2BC9', 'H2BC4', 'H2AC20', 'H2AC16', 'AXL', 'BAZ2A', 'GSE1', 'GRK2', 'HILPDA', 'HNRNPA2B1', 'INF2', 'HOXC13', 'INCENP', 'ILRUN', 'IGFBP5', 'ARHGAP42', 'ARL13B', 'ARL2', 'IFITM3', 'IFI27L2', 'ARNTL', 'ARNTL2', 'HSPA8', 'ARPP19', 'HSP90B1', 'ARSA', 'ARTN', 'ATF5', 'ATN1', 'ARHGDIA', 'ARF3', 'ARHGAP26', 'MARK3', 'MLLT6', 'MLLT3', 'MKNK1', 'MIXL1', 'MIR663AHG', 'MIR210HG', 'MIOS-DT', 'MIF', 'MGRN1', 'MGLL', 'METTL26', 'MELTF-AS1', 'MED18', 'MDM2', 'MCM4', 'MCM3AP', 'MB', 'MMP1', 'AMFR', 'AKR1C3', 'MSMB', 'ABL1', 'ABO', 'ACAT2', 'ACSL4', 'MSR1', 'ADARB1', 'AFF1', 'AJAP1', 'MMP2', 'AK4', 'AKAP5', 'MRPL55', 'MPHOSPH9', 'MPHOSPH6', 'MPDU1', 'ZRANB1', 'MAZ', 'MARK2', 'KPNA4', 'MAPKAPK2', 'LINC01291', 'LINC01133', 'LINC01116', 'LIMCH1', 'ANKRD9', 'LETM1', 'LDLRAP1', 'ANXA6', 'LDHB', 'LCLAT1', 'LAD1', 'KYNU', 'KRT80', 'KRT8', 'BBOF1', 'ARFGEF1', 'KRT18', 'LINC01304', 'LINC01902', 'LINC02367', 'LXN', 'MAP3K13', 'MAP2K3', 'MALAT1', 'ANKEF1', 'MAD2L1', 'LYAR', 'ANKRD17', 'LTBR', 'LINC02511', 'LRRFIP2', 'LPP', 'ANKRD40', 'LOXL2', 'LMNB2', 'ANKRD52', 'LINC02541', 'GREM1', 'BCL3', 'GPRC5A', 'DNAJA3', 'DNAH11', 'DNAAF5', 'DLD', 'DKK1', 'DKC1', 'DHX38', 'DHX37', 'CAVIN3', 'DHCR7', 'DGKZ', 'DGKD', 'DERA', 'DDX54', 'CBFA2T3', 'DDX23', 'DDIT3', 'DCTN1', 'DNAJA1', 'DNAJA4', 'DBT', 'DNAJC21', 'CAPZA1', 'CARM1', 'CASP8AP2', 'EGLN1', 'EFNA5', 'EFNA2', 'ECH1', 'EBAG9', 'DYSF', 'DYNC2I2', 'DVL3', 'DUSP9', 'DUSP5', 'DTYMK', 'DTNB', 'DSCAM-AS1', 'DNMT3A', 'DCAKD', 'DBNDD1', 'EIF2B4', 'CNOT6L', 'CMIP', 'CLTB', 'CLSPN', 'CLIP2', 'CLIC1', 'CCNG2', 'CD44', 'CITED2', 'CD47', 'CDC20', 'CIAO2A', 'CHAC2', 'CFAP97', 'CFAP251', 'CERS2', 'CEP83', 'CEP120', 'CCDC34', 'CNOT9', 'DANT1', 'CNR2', 'DAAM1', 'CYP1B1', 'CTXN1', 'CCDC168', 'CSNK2A2', 'CSK', 'CS', 'CRTC2', 'CRNDE', 'CRIP2', 'CREB1', 'CCDC18', 'CPTP', 'CPNE2', 'CPEB1', 'COX8A', 'COL6A3', 'EHBP1L1', 'EIF3A', 'BCAS3', 'GAB2', 'BLOC1S3', 'FTL', 'FSD1L', 'FRY', 'FRS2', 'BMS1', 'BOLA3', 'FOS', 'FLOT2', 'FLNA', 'FGF8', 'BRAT1', 'BRIP1', 'FDFT1', 'BRMS1', 'FBXL18', 'FBXL17', 'FYN', 'GABPB2', 'FBXL16', 'GABRE', 'GPATCH4', 'CENPB', 'BHLHE40', 'GOLGA3', 'GNAQ', 'GLE1', 'GJB3', 'GIN1', 'GDPGP1', 'GDI1', 'GDF15', 'GDAP2', 'GCAT', 'GBP1P1', 'GATAD2A', 'BICDL1', 'BLCAP', 'BRPF3', 'FBRS', 'CAP1', 'FAH', 'ETF1', 'ESRP2', 'MT-ATP8', 'EPPK1', 'EPHX1', 'ENTR1', 'ENOX2', 'ENO2', 'ENKD1', 'CACNB2', 'CACNG4', 'ELP3', 'ELOB', 'ELOA', 'CALHM2', 'EIF5', 'EIF4G2', 'EXOC7', 'CACHD1', 'FASTKD5', 'CA9', 'FARP1', 'FAM83A', 'FAM50A', 'FAM189B', 'FAM177A1', 'BTBD7P1', 'C16orf91', 'FAM13B', 'C1orf53', 'C2orf49', 'C6orf62', 'C7orf50', 'FAM126B', 'FAM111B', 'FAM104A', 'FAM102A', 'C9orf78', 'MNS1', 'AAMP', 'SYT14', 'RAB27A', 'TSHZ2', 'RAB3GAP1', 'SNRNP70', 'RAB35', 'RAB30', 'TBC1D9', 'RAB2B', 'RAB1B', 'TRAF3IP2', 'RAB12', 'RAB11FIP4', 'TSPYL1', 'PYGO2', 'TSR1', 'SNX22', 'PTP4A2', 'RAB5C', 'RABEP1', 'TRIM52-AS1', 'RAD23A', 'RNF122', 'RHOT2', 'RHBDD2', 'RGS10', 'RFK', 'TRAK1', 'RCC1L', 'TRAK2', 'RBSN', 'RBBP6', 'TBKBP1', 'RAPGEF3', 'TBCA', 'TRIM37', 'RAI14', 'SNX24', 'PTGR1', 'PSMG1', 'SPG21', 'PRR12', 'SPN', 'PRMT6', 'PREX1', 'PRDX1', 'TARS1', 'TAOK3', 'PPTC7', 'PPP4R2', 'SPRY1', 'PPP1R12B', 'TAF9B', 'PPM1G', 'PPIL1', 'PPIG', 'TXNRD2', 'TXN', 'TTL', 'PRR34-AS1', 'PSME4', 'SNX27', 'PSMD5', 'PSMD2', 'PSMD14', 'PSIP1', 'SOCS2', 'PRXL2C', 'TUBB6', 'SOS1', 'SPATS2L', 'PRRC2C', 'PRRC2A', 'PRR5L', 'TATDN2', 'RNF146', 'RNF25', 'TCHP', 'SCYL2', 'SETD2', 'SERINC5', 'SERF2', 'SENP6', 'THRB', 'TIAM1', 'SECISBP2L', 'TIMELESS', 'TPX2', 'SAT2', 'SART1', 'SMC6', 'SAMD4A', 'TMEM238', 'TMEM256', 'S100P', 'TCF7L1', 'SETD3', 'SF3B4', 'SH3RF1', 'TEDC2-AS1', 'SLC48A1', 'SMARCB1', 'TGDS', 'TGFB3', 'SLC2A6', 'THAP1', 'SLC25A24', 'SMC5', 'SLC13A5', 'SLAIN2', 'SINHCAFP3', 'SIGMAR1', 'SHOX', 'SHISA5', 'TMEM259', 'SMIM27', 'RTL8C', 'SNHG18', 'RPLP1', 'RPLP0P2', 'TNNT1', 'TOLLIP', 'SNHG9', 'TCF20', 'RPL36', 'SNORD3B-1', 'RPL35', 'TPBG', 'TPM1', 'RPL17', 'RPL12', 'TPM4', 'TCEAL9', 'RPLP2', 'RPS14', 'RSRC2', 'TNIP2', 'RRS1', 'RRP1B', 'RRAS', 'TMEM70', 'TMEM80', 'SMKR1', 'RPS5', 'TMSB4XP4', 'RPS29', 'RPS28', 'RPS21', 'TNFRSF12A', 'RPS19', 'TNFSF13B', 'MT-ND2', 'SRA1', 'SRCAP', 'POLR3GL', 'NDUFB4', 'NFIC', 'NEUROD2', 'NEDD9', 'ZNF263', 'NEDD1', 'ZNF318', 'NDUFC1', 'ZNF326', 'UBE2Q2', 'NDUFA8', 'ZNF33B', 'NCOA5', 'NCOA1', 'NCLN', 'NCKAP1', 'NCK1', 'NINJ1', 'NLK', 'NMD3', 'NME1-NME2', 'NUP93', 'ZBTB7A', 'ZFC3H1', 'NSD1', 'TAF13', 'ZFP36', 'ZHX1', 'NRG4', 'ZMIZ1', 'NQO1', 'NPLOC4', 'NOP10', 'NOM1', 'ZNF202', 'NOL4L', 'NCDN', 'NCBP3', 'NCAM1', 'MT1X', 'MT-TY', 'MT-TV', 'MT-TS2', 'MT-TP', 'MT-TN', 'ZNF702P', 'SYNJ2', 'MT-TM', 'MT-TL1', 'MT-TE', 'ZNF703', 'MT-TD', 'ZNF764', 'MT-ND5', 'ZNRF1', 'MT1E', 'MTA2', 'NBEAL2', 'ZNF480', 'NAXD', 'NACC1', 'NACA4P', 'NAA10', 'MYO5C', 'ZNF354A', 'MYO10', 'MYH14', 'ZNF418', 'MXRA5', 'MXI1', 'SULF2', 'MTND2P28', 'SYNE2', 'MTND1P23', 'NUPR2', 'OPTN', 'OTUD7B', 'PLCD3', 'PLCB4', 'PLBD2', 'USP35', 'UTP18', 'PLAU', 'UTP3', 'STARD10', 'PKIB', 'PITX1', 'PITPNA', 'PICALM', 'PI4KB', 'PHRF1', 'PHLDA2', 'PHF20L1', 'USP32', 'PLCE1', 'PHACTR1', 'PLD1', 'SREK1IP1P1', 'SRFBP1', 'SRM', 'UIMC1', 'POLE4', 'POLDIP2', 'UPK1B', 'POLB', 'TAF15', 'UQCC2', 'PMEPA1', 'SSX2IP', 'UQCR11', 'UQCRQ', 'PLIN2', 'STRBP', 'VEGFB', 'OVOL1', 'PCYT1A', 'YTHDF1', 'PCDHGA10', 'PCDHB1', 'YWHAB', 'ZBED2', 'ZBED4', 'PATL1', 'PAQR8', 'PAQR7', 'PAPOLA', 'PAK2', 'PACS1', 'ZBTB2', 'P4HA2', 'ZBTB34', 'YKT6', 'PDAP1', 'PFDN4', 'PDCD4', 'VIT', 'PGAM5', 'VMP1', 'PGAM1', 'VPS45', 'VPS9D1-AS1', 'VRK3', 'SLC9A3R1', 'WDR77', 'PDS5A', 'WSB2', 'PDLIM1', 'STRIP1', 'WWC3', 'XBP1', 'TFRC', 'CDC6', 'ATP9A', 'CD9', 'ATP5MK', 'ATP5MG', 'CDC25B', 'ATP5ME', 'UBC', 'UGDH', 'ATAD2', 'ASB2', 'CDK2AP2', 'TYSND1', 'ARRDC3', 'ARPC1B', 'CEACAM5', 'VCPIP1', 'VEGFA', 'WTAPP1', 'ALDH1A3', 'ZNF473', 'ADAP1', 'ACLY', 'UBB', 'TXNIP', 'CANX', 'TRIM29', 'THBS1', 'CALB1', 'TMEM45A', 'TMSB4X', 'C1orf116', 'C10orf55', 'BUB1B', 'BUB1', 'TPT1', 'TRIM16', 'TSPO', 'TUBG1', 'BIRC5', 'BEND7', 'TST', 'TUBB', 'CCM2', 'TUBB4B', 'CENPF', 'CCNB1', 'CCNB2', 'BAG3', 'TUBD1', 'RPS15A', 'SYTL2', 'ID3', 'HNRNPM', 'HNRNPU', 'HPCAL1', 'HRH1', 'PLK2', 'HSPB1', 'IER2', 'HLA-A', 'IFITM2', 'PLAC8', 'PHC1', 'IRF2BP2', 'IRF6', 'ISG15', 'HMGCS1', 'HK2', 'PFKP', 'H2AX', 'GPX2', 'GSTP1', 'H1-0', 'H1-1', 'H1-3', 'PRNP', 'PRC1', 'POLR2A', 'PPP1R3G', 'HBP1', 'HERPUD1', 'PPIF', 'POLR3A', 'POLR2L', 'ITGA6', 'PFKFB4', 'PRSS8', 'MUL1', 'LMNA', 'NECTIN2', 'NDUFB2', 'LY6D', 'MCM7', 'MYC', 'MIF-AS1', 'NR4A1', 'MTATP6P1', 'MOB3A', 'MOV10', 'MRNIP', 'MRPS2', 'MSMO1', 'NOLC1', 'LBH', 'PERP', 'PCDH1', 'JUNB', 'JUP', 'PDK1', 'PDIA3', 'PCNA', 'KDM5B', 'KIF23', 'LAMB3', 'KIF2C', 'PABPC1', 'KNSTRN', 'NUP188', 'NT5C', 'NRP1', 'PRSS23', 'PSAP', 'CHAC1', 'DHRS3', 'DCBLD2', 'SLCO4A1', 'DDX5', 'SLC3A2', 'SLC39A6', 'SLC38A2', 'SLC20A1', 'CYB561A3', 'DNMT1', 'SENP3', 'SEMA4B', 'SCD', 'EBP', 'EEF1A1', 'CYP1B1-AS1', 'CXCL1', 'S100A6', 'SPP1', 'CKAP2', 'SRXN1', 'CLDN7', 'SRSF8', 'SQSTM1', 'SQLE', 'COX7A2', 'SNRNP25', 'COX7C', 'CPEB4', 'SPAG5', 'CRABP2', 'SNX33', 'SNRPD2', 'EEF2', 'S100A2', 'PSME2', 'FSCN1', 'FDPS', 'FGD5-AS1', 'FGFBP1', 'RALGDS', 'FN1', 'RAC1', 'FTH1', 'ROMO1', 'FYB1', 'QSOX1', 'PYCR3', 'PUSL1', 'PTMS', 'GFRA1', 'FASN', 'RPL10', 'EIF4A2', 'RPS12', 'RPSA', 'RPS8', 'RPS3', 'RPS2', 'RPS16', 'RPS15', 'EZR', 'RPL11', 'F3', 'RPL8', 'RPL37', 'RPL35A', 'FAM13A', 'RPL21', 'ZWINT']

Top three selected genes:

  • PGK1: Encodes Phosphoglycerate Kinase 1, a key metabolic enzyme.
    • Under hypoxic conditions, cells shift from oxidative phosphorylation to anaerobic glycolysis for energy production. PGK1 plays a key role in the anaerobic pathway, allowing cells to produce ATP even in the absence of oxygen.
  • MT-CYB: Encodes Cytochrome B, a protein involved in mitochondria.
    • Component of the mitochondrial electron transport chain (ETC), whose efficiency depends on oxygen. Under hypoxia, ETC is disrupted, leading to mitochondrial stress, altered expression, and possibly reduced oxidative phosphorylation.
  • MT-CO3: Encodes Cytochrome C Oxidase 3, a protein involved in mitochondria.
    • Involved in the final step of ETC, reducing oxygen to water. This gene is directly affected by oxygen, or the lack thereof.

These are just a few examples of the biological background of the selected genes. Based on the function behind each gene, it is clear why they have been selected by the models.

Top principal components¶

Since PCA depends on the data set, feature selection for the principal components is done by data set across all the models.

In [ ]:
ss_mcf7_top_pcs_occurrences = count_and_sort_occurrences([
    ss_mcf7_pca_logit_pcs,
    ss_mcf7_pca_svm_pcs,
    ss_mcf7_pca_random_forest_pcs
], False)
ss_mcf7_top_pcs = {}

for i in range(3, 0, -1):
    ss_mcf7_top_pcs[i] = ss_mcf7_top_pcs.get(i + 1, []) + filter_by_occurrences(ss_mcf7_top_pcs_occurrences, i)

for i in range(3, 0, -1):
    if len(ss_mcf7_top_pcs[i]) == 0:
        continue
    print(f"{len(ss_mcf7_top_pcs[i])} PC(s) selected for SmartSeq MCF7 across {i}+ model(s):")
    print(ss_mcf7_top_pcs[i])
    print()
3 PC(s) selected for SmartSeq MCF7 across 3+ model(s):
[6, 3, 1]

8 PC(s) selected for SmartSeq MCF7 across 2+ model(s):
[6, 3, 1, 18, 17, 16, 12, 8]

13 PC(s) selected for SmartSeq MCF7 across 1+ model(s):
[6, 3, 1, 18, 17, 16, 12, 8, 15, 9, 5, 4, 2]

In [ ]:
ss_hcc_top_pcs_occurrences = count_and_sort_occurrences([
    ss_hcc_pca_logit_pcs,
    ss_hcc_pca_svm_pcs,
    ss_hcc_pca_random_forest_pcs
], False)
ss_hcc_top_pcs = {}

for i in range(3, 0, -1):
    ss_hcc_top_pcs[i] = ss_hcc_top_pcs.get(i + 1, []) + filter_by_occurrences(ss_hcc_top_pcs_occurrences, i)

for i in range(3, 0, -1):
    if len(ss_hcc_top_pcs[i]) == 0:
        continue
    print(f"{len(ss_hcc_top_pcs[i])} PC(s) selected for SmartSeq HCC across {i}+ models(s):")
    print(ss_hcc_top_pcs[i])
    print()
2 PC(s) selected for SmartSeq HCC across 3+ models(s):
[3, 2]

7 PC(s) selected for SmartSeq HCC across 2+ models(s):
[3, 2, 26, 17, 12, 10, 9]

15 PC(s) selected for SmartSeq HCC across 1+ models(s):
[3, 2, 26, 17, 12, 10, 9, 32, 30, 23, 21, 16, 15, 13, 4]

In [ ]:
ds_mcf7_top_pcs_occurrences = count_and_sort_occurrences([
    ds_mcf7_pca_logit_pcs,
    ds_mcf7_pca_svm_pcs,
    ds_mcf7_pca_random_forest_pcs
], False)
ds_mcf7_top_pcs = {}

for i in range(3, 0, -1):
    ds_mcf7_top_pcs[i] = ds_mcf7_top_pcs.get(i + 1, []) + filter_by_occurrences(ds_mcf7_top_pcs_occurrences, i)

for i in range(3, 0, -1):
    if len(ds_mcf7_top_pcs[i]) == 0:
        continue
    print(f"{len(ds_mcf7_top_pcs[i])} PC(s) selected for DropSeq MCF7 across {i}+ models(s):")
    print(ds_mcf7_top_pcs[i])
    print()
42 PC(s) selected for DropSeq MCF7 across 3+ models(s):
[1, 149, 82, 95, 110, 116, 121, 140, 142, 145, 147, 157, 55, 170, 232, 442, 317, 318, 322, 344, 380, 383, 60, 167, 26, 16, 36, 25, 32, 8, 15, 33, 17, 6, 37, 27, 28, 3, 18, 2, 19, 5]

278 PC(s) selected for DropSeq MCF7 across 2+ models(s):
[1, 149, 82, 95, 110, 116, 121, 140, 142, 145, 147, 157, 55, 170, 232, 442, 317, 318, 322, 344, 380, 383, 60, 167, 26, 16, 36, 25, 32, 8, 15, 33, 17, 6, 37, 27, 28, 3, 18, 2, 19, 5, 267, 43, 271, 273, 275, 264, 252, 263, 254, 281, 249, 247, 239, 236, 235, 234, 231, 230, 219, 218, 213, 212, 279, 302, 282, 286, 385, 381, 377, 375, 371, 370, 364, 361, 353, 352, 350, 348, 342, 341, 339, 756, 332, 327, 323, 320, 319, 312, 305, 206, 293, 291, 287, 211, 200, 205, 31, 114, 112, 107, 105, 104, 100, 99, 94, 92, 91, 88, 87, 85, 81, 74, 71, 69, 66, 65, 62, 61, 57, 56, 40, 52, 46, 45, 115, 118, 204, 119, 203, 201, 44, 198, 195, 193, 191, 190, 188, 177, 175, 173, 172, 389, 161, 160, 153, 146, 141, 29, 138, 135, 133, 128, 127, 30, 120, 387, 758, 391, 543, 552, 555, 556, 557, 564, 565, 576, 580, 582, 585, 591, 596, 597, 598, 599, 602, 606, 546, 541, 486, 540, 491, 494, 495, 496, 497, 499, 504, 507, 508, 510, 512, 517, 520, 522, 392, 534, 538, 610, 612, 615, 621, 696, 698, 700, 702, 718, 722, 724, 726, 732, 733, 734, 741, 742, 746, 751, 754, 755, 682, 681, 677, 647, 623, 626, 631, 632, 633, 642, 646, 650, 675, 652, 653, 655, 658, 661, 672, 674, 487, 527, 485, 411, 406, 466, 464, 462, 461, 460, 429, 459, 409, 455, 449, 403, 415, 438, 418, 419, 436, 435, 434, 433, 484, 431, 469, 408, 402, 393, 475, 481, 400, 399, 401, 398, 471, 470]

372 PC(s) selected for DropSeq MCF7 across 1+ models(s):
[1, 149, 82, 95, 110, 116, 121, 140, 142, 145, 147, 157, 55, 170, 232, 442, 317, 318, 322, 344, 380, 383, 60, 167, 26, 16, 36, 25, 32, 8, 15, 33, 17, 6, 37, 27, 28, 3, 18, 2, 19, 5, 267, 43, 271, 273, 275, 264, 252, 263, 254, 281, 249, 247, 239, 236, 235, 234, 231, 230, 219, 218, 213, 212, 279, 302, 282, 286, 385, 381, 377, 375, 371, 370, 364, 361, 353, 352, 350, 348, 342, 341, 339, 756, 332, 327, 323, 320, 319, 312, 305, 206, 293, 291, 287, 211, 200, 205, 31, 114, 112, 107, 105, 104, 100, 99, 94, 92, 91, 88, 87, 85, 81, 74, 71, 69, 66, 65, 62, 61, 57, 56, 40, 52, 46, 45, 115, 118, 204, 119, 203, 201, 44, 198, 195, 193, 191, 190, 188, 177, 175, 173, 172, 389, 161, 160, 153, 146, 141, 29, 138, 135, 133, 128, 127, 30, 120, 387, 758, 391, 543, 552, 555, 556, 557, 564, 565, 576, 580, 582, 585, 591, 596, 597, 598, 599, 602, 606, 546, 541, 486, 540, 491, 494, 495, 496, 497, 499, 504, 507, 508, 510, 512, 517, 520, 522, 392, 534, 538, 610, 612, 615, 621, 696, 698, 700, 702, 718, 722, 724, 726, 732, 733, 734, 741, 742, 746, 751, 754, 755, 682, 681, 677, 647, 623, 626, 631, 632, 633, 642, 646, 650, 675, 652, 653, 655, 658, 661, 672, 674, 487, 527, 485, 411, 406, 466, 464, 462, 461, 460, 429, 459, 409, 455, 449, 403, 415, 438, 418, 419, 436, 435, 434, 433, 484, 431, 469, 408, 402, 393, 475, 481, 400, 399, 401, 398, 471, 470, 424, 9, 745, 54, 337, 7, 427, 666, 420, 59, 4, 47, 422, 426, 667, 48, 13, 12, 729, 23, 367, 21, 365, 20, 376, 736, 379, 711, 705, 35, 362, 743, 38, 14, 358, 355, 687, 685, 335, 97, 331, 515, 467, 269, 265, 181, 182, 186, 260, 519, 518, 197, 258, 514, 329, 257, 255, 474, 253, 506, 483, 245, 221, 243, 241, 240, 278, 151, 458, 571, 644, 328, 430, 326, 627, 96, 624, 321, 437, 314, 603, 307, 446, 303, 594, 592, 301, 300, 297, 579, 577, 144, 457, 338]

In [ ]:
ds_hcc_top_pcs_occurrences = count_and_sort_occurrences([
    ds_hcc_pca_logit_pcs,
    ds_hcc_pca_svm_pcs,
    ds_hcc_pca_random_forest_pcs,
], False)
ds_hcc_top_pcs = {}

for i in range(3, 0, -1):
    ds_hcc_top_pcs[i] = ds_hcc_top_pcs.get(i + 1, []) + filter_by_occurrences(ds_hcc_top_pcs_occurrences, i)

for i in range(3, 0, -1):
    if len(ds_hcc_top_pcs[i]) == 0:
        continue
    print(f"{len(ds_hcc_top_pcs[i])} PC(s) selected for DropSeq HCC across {i}+ models(s):")
    print(ds_hcc_top_pcs[i])
    print()
69 PC(s) selected for DropSeq HCC across 3+ models(s):
[187, 190, 201, 301, 198, 197, 99, 26, 24, 23, 21, 30, 189, 20, 19, 18, 102, 184, 182, 63, 208, 31, 176, 47, 259, 41, 39, 46, 38, 240, 262, 37, 48, 290, 90, 49, 270, 53, 36, 54, 34, 76, 15, 45, 11, 141, 147, 127, 8, 69, 155, 6, 142, 106, 5, 103, 161, 145, 65, 167, 12, 4, 169, 170, 172, 3, 2, 72, 174]

298 PC(s) selected for DropSeq HCC across 2+ models(s):
[187, 190, 201, 301, 198, 197, 99, 26, 24, 23, 21, 30, 189, 20, 19, 18, 102, 184, 182, 63, 208, 31, 176, 47, 259, 41, 39, 46, 38, 240, 262, 37, 48, 290, 90, 49, 270, 53, 36, 54, 34, 76, 15, 45, 11, 141, 147, 127, 8, 69, 155, 6, 142, 106, 5, 103, 161, 145, 65, 167, 12, 4, 169, 170, 172, 3, 2, 72, 174, 77, 351, 350, 356, 266, 319, 265, 357, 358, 263, 345, 261, 359, 260, 360, 80, 272, 294, 275, 282, 315, 313, 312, 325, 310, 255, 326, 329, 308, 842, 332, 334, 303, 336, 300, 297, 295, 318, 292, 339, 341, 843, 219, 254, 153, 183, 181, 180, 179, 178, 177, 175, 171, 166, 162, 157, 154, 148, 193, 143, 140, 139, 137, 136, 135, 115, 117, 131, 126, 118, 124, 191, 96, 88, 227, 249, 247, 89, 243, 239, 238, 237, 235, 234, 231, 230, 229, 225, 200, 224, 221, 220, 369, 218, 217, 215, 213, 92, 210, 94, 205, 362, 391, 371, 672, 640, 641, 642, 27, 651, 657, 691, 577, 692, 693, 698, 700, 701, 16, 637, 29, 630, 622, 613, 612, 610, 609, 603, 32, 600, 598, 594, 372, 584, 583, 582, 713, 717, 718, 785, 835, 831, 7, 814, 813, 810, 809, 808, 807, 799, 794, 793, 792, 789, 784, 730, 783, 781, 10, 777, 769, 766, 763, 761, 760, 758, 756, 749, 747, 746, 581, 586, 576, 413, 450, 448, 575, 447, 445, 444, 55, 439, 438, 60, 427, 420, 415, 414, 412, 453, 409, 408, 401, 399, 396, 120, 384, 383, 380, 379, 377, 375, 374, 373, 451, 123, 509, 563, 457, 562, 525, 551, 521, 516, 515, 540, 550, 507, 561, 503, 564, 566, 494, 490, 567, 479, 474, 461, 528]

400 PC(s) selected for DropSeq HCC across 1+ models(s):
[187, 190, 201, 301, 198, 197, 99, 26, 24, 23, 21, 30, 189, 20, 19, 18, 102, 184, 182, 63, 208, 31, 176, 47, 259, 41, 39, 46, 38, 240, 262, 37, 48, 290, 90, 49, 270, 53, 36, 54, 34, 76, 15, 45, 11, 141, 147, 127, 8, 69, 155, 6, 142, 106, 5, 103, 161, 145, 65, 167, 12, 4, 169, 170, 172, 3, 2, 72, 174, 77, 351, 350, 356, 266, 319, 265, 357, 358, 263, 345, 261, 359, 260, 360, 80, 272, 294, 275, 282, 315, 313, 312, 325, 310, 255, 326, 329, 308, 842, 332, 334, 303, 336, 300, 297, 295, 318, 292, 339, 341, 843, 219, 254, 153, 183, 181, 180, 179, 178, 177, 175, 171, 166, 162, 157, 154, 148, 193, 143, 140, 139, 137, 136, 135, 115, 117, 131, 126, 118, 124, 191, 96, 88, 227, 249, 247, 89, 243, 239, 238, 237, 235, 234, 231, 230, 229, 225, 200, 224, 221, 220, 369, 218, 217, 215, 213, 92, 210, 94, 205, 362, 391, 371, 672, 640, 641, 642, 27, 651, 657, 691, 577, 692, 693, 698, 700, 701, 16, 637, 29, 630, 622, 613, 612, 610, 609, 603, 32, 600, 598, 594, 372, 584, 583, 582, 713, 717, 718, 785, 835, 831, 7, 814, 813, 810, 809, 808, 807, 799, 794, 793, 792, 789, 784, 730, 783, 781, 10, 777, 769, 766, 763, 761, 760, 758, 756, 749, 747, 746, 581, 586, 576, 413, 450, 448, 575, 447, 445, 444, 55, 439, 438, 60, 427, 420, 415, 414, 412, 453, 409, 408, 401, 399, 396, 120, 384, 383, 380, 379, 377, 375, 374, 373, 451, 123, 509, 563, 457, 562, 525, 551, 521, 516, 515, 540, 550, 507, 561, 503, 564, 566, 494, 490, 567, 479, 474, 461, 528, 17, 70, 9, 35, 73, 74, 42, 44, 68, 86, 110, 109, 95, 75, 97, 56, 100, 13, 78, 14, 305, 121, 510, 656, 653, 646, 644, 632, 608, 604, 601, 599, 597, 593, 565, 553, 543, 541, 539, 534, 667, 674, 685, 812, 841, 839, 832, 829, 828, 820, 815, 778, 689, 771, 755, 733, 716, 714, 705, 699, 523, 499, 125, 497, 250, 245, 233, 212, 207, 202, 199, 196, 195, 192, 185, 159, 156, 152, 151, 134, 133, 253, 257, 267, 403, 488, 466, 452, 443, 434, 429, 423, 402, 269, 393, 382, 344, 338, 323, 317, 283, 1]

Models trained on selected genes¶

To generalize the models and make them more robust, new ones can be trained on various subsets of the selected genes. The hope is to produce models which train faster while still attaining equal or greater accuracy. This can even lead to a generalized model which is agnostic with respect to the data set.

In [ ]:
top_genes_ss_mcf7 = {i: list(filter(lambda x: x in X_ss_mcf7.columns, top_genes[i])) for i in range (1, 13)}
X_top_genes_ss_mcf7 = {
    i: X_ss_mcf7.loc[:, top_genes_ss_mcf7[i]] for i in range(1, 13)
}

top_genes_ss_hcc = {i: list(filter(lambda x: x in X_ss_hcc.columns, top_genes[i])) for i in range (1, 13)}
X_top_genes_ss_hcc = {
    i: X_ss_hcc.loc[:, top_genes_ss_hcc[i]] for i in range(1, 13)
}

top_genes_ds_mcf7 = {i: list(filter(lambda x: x in X_ds_mcf7.columns, top_genes[i])) for i in range (1, 13)}
X_top_genes_ds_mcf7 = {
    i: X_ds_mcf7.loc[:, top_genes_ds_mcf7[i]] for i in range(1, 13)
}

top_genes_ds_hcc = {i: list(filter(lambda x: x in X_ds_hcc.columns, top_genes[i])) for i in range (1, 13)}
X_top_genes_ds_hcc = {
    i: X_ds_hcc.loc[:, top_genes_ds_hcc[i]] for i in range(1, 13)
}
In [ ]:
for i, genes in top_genes_ss_mcf7.items():
    print(f"{len(genes)} genes selected {i}+ times for SmartSeq MCF7")
print()

for i, genes in top_genes_ss_hcc.items():
    print(f"{len(genes)} genes selected {i}+ times for SmartSeq HCC")
print()

for i, genes in top_genes_ds_mcf7.items():
    print(f"{len(genes)} genes selected {i}+ times for DropSeq MCF7")
print()

for i, genes in top_genes_ds_hcc.items():
    print(f"{len(genes)} genes selected {i}+ times for DropSeq HCC")
print()
302 genes selected 1+ times for SmartSeq MCF7
195 genes selected 2+ times for SmartSeq MCF7
76 genes selected 3+ times for SmartSeq MCF7
47 genes selected 4+ times for SmartSeq MCF7
30 genes selected 5+ times for SmartSeq MCF7
17 genes selected 6+ times for SmartSeq MCF7
9 genes selected 7+ times for SmartSeq MCF7
6 genes selected 8+ times for SmartSeq MCF7
3 genes selected 9+ times for SmartSeq MCF7
2 genes selected 10+ times for SmartSeq MCF7
1 genes selected 11+ times for SmartSeq MCF7
1 genes selected 12+ times for SmartSeq MCF7

279 genes selected 1+ times for SmartSeq HCC
160 genes selected 2+ times for SmartSeq HCC
63 genes selected 3+ times for SmartSeq HCC
40 genes selected 4+ times for SmartSeq HCC
26 genes selected 5+ times for SmartSeq HCC
15 genes selected 6+ times for SmartSeq HCC
8 genes selected 7+ times for SmartSeq HCC
7 genes selected 8+ times for SmartSeq HCC
3 genes selected 9+ times for SmartSeq HCC
2 genes selected 10+ times for SmartSeq HCC
1 genes selected 11+ times for SmartSeq HCC
1 genes selected 12+ times for SmartSeq HCC

554 genes selected 1+ times for DropSeq MCF7
507 genes selected 2+ times for DropSeq MCF7
112 genes selected 3+ times for DropSeq MCF7
74 genes selected 4+ times for DropSeq MCF7
31 genes selected 5+ times for DropSeq MCF7
13 genes selected 6+ times for DropSeq MCF7
8 genes selected 7+ times for DropSeq MCF7
6 genes selected 8+ times for DropSeq MCF7
3 genes selected 9+ times for DropSeq MCF7
2 genes selected 10+ times for DropSeq MCF7
1 genes selected 11+ times for DropSeq MCF7
1 genes selected 12+ times for DropSeq MCF7

578 genes selected 1+ times for DropSeq HCC
507 genes selected 2+ times for DropSeq HCC
136 genes selected 3+ times for DropSeq HCC
94 genes selected 4+ times for DropSeq HCC
44 genes selected 5+ times for DropSeq HCC
20 genes selected 6+ times for DropSeq HCC
10 genes selected 7+ times for DropSeq HCC
7 genes selected 8+ times for DropSeq HCC
3 genes selected 9+ times for DropSeq HCC
2 genes selected 10+ times for DropSeq HCC
1 genes selected 11+ times for DropSeq HCC
1 genes selected 12+ times for DropSeq HCC

Logistic regression¶

In [ ]:
ss_mcf7_top_genes_logit: dict[int, TrainedModelWrapper] = {}

ss_hcc_top_genes_logit: dict[int, TrainedModelWrapper] = {}

ds_mcf7_top_genes_logit: dict[int, TrainedModelWrapper] = {}

ds_hcc_top_genes_logit: dict[int, TrainedModelWrapper] = {}

SmartSeq MCF7

In [ ]:
for i in [12, 6, 1]:
    ss_mcf7_top_genes_logit[i] = train_test_logistic_regression(X_top_genes_ss_mcf7[i], y_ss_mcf7, n_jobs = -1, verbose = False)
    print(f"Genes selected across {i}+ model(s). Accuracy: {ss_mcf7_top_genes_logit[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.9523809523809523
Genes selected across 6+ model(s). Accuracy: 1.0
Genes selected across 1+ model(s). Accuracy: 1.0

SmartSeq HCC

In [ ]:
for i in [12, 6, 1]:
    ss_hcc_top_genes_logit[i] = train_test_logistic_regression(X_top_genes_ss_hcc[i], y_ss_hcc, n_jobs = -1, verbose = False)
    print(f"Genes selected across {i}+ model(s). Accuracy: {ss_hcc_top_genes_logit[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.9782608695652174
Genes selected across 6+ model(s). Accuracy: 0.9782608695652174
Genes selected across 1+ model(s). Accuracy: 0.9782608695652174

DropSeq MCF7

In [ ]:
for i in [12, 6, 1]:
    ds_mcf7_top_genes_logit[i] = train_test_logistic_regression(X_top_genes_ds_mcf7[i], y_ds_mcf7, n_jobs = -1, verbose = False)
    print(f"Genes selected across {i}+ model(s). Accuracy: {ds_mcf7_top_genes_logit[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.726465692620677
Genes selected across 6+ model(s). Accuracy: 0.8823746994636582
Genes selected across 1+ model(s). Accuracy: 0.9774366561864251

DropSeq HCC

In [ ]:
for i in [12, 6, 1]:
    ds_hcc_top_genes_logit[i] = train_test_logistic_regression(X_top_genes_ds_hcc[i], y_ds_hcc, n_jobs = -1, verbose = False)
    print(f"Genes selected across {i}+ model(s). Accuracy: {ds_hcc_top_genes_logit[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.6061018795968401
Genes selected across 6+ model(s). Accuracy: 0.8136747480250613
Genes selected across 1+ model(s). Accuracy: 0.9476981748842277

SVM¶

In [ ]:
ss_mcf7_top_genes_svm: dict[int, TrainedModelWrapper] = {}

ss_hcc_top_genes_svm: dict[int, TrainedModelWrapper] = {}

ds_mcf7_top_genes_svm: dict[int, TrainedModelWrapper] = {}

ds_hcc_top_genes_svm: dict[int, TrainedModelWrapper] = {}

SmartSeq MCF7

In [ ]:
for i in [12, 6, 1]:
    ss_mcf7_top_genes_svm[i] = train_test_svm(X_top_genes_ss_mcf7[i], y_ss_mcf7, n_jobs = -1, verbose = False)
    print(f"Genes selected across {i}+ model(s). Accuracy: {ss_mcf7_top_genes_svm[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.9365079365079365
Genes selected across 6+ model(s). Accuracy: 1.0
Genes selected across 1+ model(s). Accuracy: 1.0

SmartSeq HCC

In [ ]:
for i in [12, 6, 1]:
    ss_hcc_top_genes_svm[i] = train_test_svm(X_top_genes_ss_hcc[i], y_ss_hcc, n_jobs = -1, verbose = False)
    print(f"Genes selected across {i}+ model(s). Accuracy: {ss_hcc_top_genes_svm[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.9782608695652174
Genes selected across 6+ model(s). Accuracy: 0.9565217391304348
Genes selected across 1+ model(s). Accuracy: 0.9782608695652174

DropSeq MCF7

In [ ]:
for i in [12, 6, 1]:
    ds_mcf7_top_genes_svm[i] = train_test_svm(X_top_genes_ds_mcf7[i], y_ds_mcf7, n_jobs = -1, verbose = False)
    print(f"Genes selected across {i}+ model(s). Accuracy: {ds_mcf7_top_genes_svm[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.726465692620677
Genes selected across 6+ model(s). Accuracy: 0.8853338265211762
Genes selected across 1+ model(s). Accuracy: 0.9772517107453301

DropSeq HCC

In [ ]:
for i in [12, 6, 1]:
    ds_hcc_top_genes_svm[i] = train_test_svm(X_top_genes_ds_hcc[i], y_ds_hcc, n_jobs = -1, verbose = False)
    print(f"Genes selected across {i}+ model(s). Accuracy: {ds_hcc_top_genes_svm[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.6061018795968401
Genes selected across 6+ model(s). Accuracy: 0.8136747480250613
Genes selected across 1+ model(s). Accuracy: 0.9506946336148189

Random forest¶

In [ ]:
ss_mcf7_top_genes_random_forest: dict[int, TrainedModelWrapper] = {}

ss_hcc_top_genes_random_forest: dict[int, TrainedModelWrapper] = {}

ds_mcf7_top_genes_random_forest: dict[int, TrainedModelWrapper] = {}

ds_hcc_top_genes_random_forest: dict[int, TrainedModelWrapper] = {}

SmartSeq MCF7

In [ ]:
for i in [12, 6, 1]:
    ss_mcf7_top_genes_random_forest[i] = train_test_random_forest(X_top_genes_ss_mcf7[i], y_ss_mcf7, n_jobs = -1, verbose = False)
    print(f"Genes selected across {i}+ model(s). Accuracy: {ss_mcf7_top_genes_random_forest[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.9365079365079365
Genes selected across 6+ model(s). Accuracy: 1.0
Genes selected across 1+ model(s). Accuracy: 1.0

SmartSeq HCC

In [ ]:
for i in [12, 6, 1]:
    ss_hcc_top_genes_random_forest[i] = train_test_random_forest(X_top_genes_ss_hcc[i], y_ss_hcc, n_jobs = -1, verbose = False)
    print(f"Genes selected across {i}+ model(s). Accuracy: {ss_hcc_top_genes_random_forest[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.9782608695652174
Genes selected across 6+ model(s). Accuracy: 0.9782608695652174
Genes selected across 1+ model(s). Accuracy: 0.9782608695652174

DropSeq MCF7

In [ ]:
for i in [12, 6, 1]:
    ds_mcf7_top_genes_random_forest[i] = train_test_random_forest(X_top_genes_ds_mcf7[i], y_ds_mcf7, n_jobs = -1, verbose = False)
    print(f"Genes selected across {i}+ model(s). Accuracy: {ds_mcf7_top_genes_random_forest[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.726465692620677
Genes selected across 6+ model(s). Accuracy: 0.884224153874607
Genes selected across 1+ model(s). Accuracy: 0.9726280747179582

DropSeq HCC

In [ ]:
for i in [12, 6, 1]:
    ds_hcc_top_genes_random_forest[i] = train_test_random_forest(X_top_genes_ds_hcc[i], y_ds_hcc, n_jobs = -1, verbose = False)
    print(f"Genes selected across {i}+ model(s). Accuracy: {ds_hcc_top_genes_random_forest[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.48597112503405065
Genes selected across 6+ model(s). Accuracy: 0.8193952601470988
Genes selected across 1+ model(s). Accuracy: 0.9406156360664669

Multilayer perceptron¶

In [ ]:
ss_mcf7_top_genes_mlp: dict[int, TrainedModelWrapper] = {}

ss_hcc_top_genes_mlp: dict[int, TrainedModelWrapper] = {}

ds_mcf7_top_genes_mlp: dict[int, TrainedModelWrapper] = {}

ds_hcc_top_genes_mlp: dict[int, TrainedModelWrapper] = {}

SmartSeq MCF7

In [ ]:
for i in [12, 6, 1]:
    ss_mcf7_top_genes_mlp[i] = train_test_mlp(X_top_genes_ss_mcf7[i], y_ss_mcf7, n_jobs = -1, verbose = False)
    print(f"Genes selected across {i}+ model(s). Accuracy: {ss_mcf7_top_genes_mlp[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.5079365079365079
Genes selected across 6+ model(s). Accuracy: 0.9206349206349206
Genes selected across 1+ model(s). Accuracy: 0.9682539682539683

SmartSeq HCC

In [ ]:
for i in [12, 6, 1]:
    ss_hcc_top_genes_mlp[i] = train_test_mlp(X_top_genes_ss_hcc[i], y_ss_hcc, n_jobs = -1, verbose = False)
    print(f"Genes selected across {i}+ model(s). Accuracy: {ss_hcc_top_genes_mlp[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.5434782608695652
Genes selected across 6+ model(s). Accuracy: 0.9130434782608695
Genes selected across 1+ model(s). Accuracy: 0.9347826086956522

DropSeq MCF7

In [ ]:
for i in [12, 6, 1]:
    ds_mcf7_top_genes_mlp[i] = train_test_mlp(X_top_genes_ds_mcf7[i], y_ds_mcf7, n_jobs = -1, verbose = False)
    print(f"Genes selected across {i}+ model(s). Accuracy: {ds_mcf7_top_genes_mlp[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.726465692620677
Genes selected across 6+ model(s). Accuracy: 0.8875531718143148
Genes selected across 1+ model(s). Accuracy: 0.980395783243943

DropSeq HCC

In [ ]:
for i in [12, 6, 1]:
    ds_hcc_top_genes_mlp[i] = train_test_mlp(X_top_genes_ds_hcc[i], y_ds_hcc, n_jobs = -1, verbose = False)
    print(f"Genes selected across {i}+ model(s). Accuracy: {ds_hcc_top_genes_mlp[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.6061018795968401
Genes selected across 6+ model(s). Accuracy: 0.8163988014165078
Genes selected across 1+ model(s). Accuracy: 0.9504222282756742

Models trained on selected PCs¶

In [ ]:
X_top_pcs_ss_mcf7 = {i: X_pca_ss_mcf7[:, ss_mcf7_top_pcs[i]] for i in range(1, 4)}
X_top_pcs_ss_hcc = {i: X_pca_ss_hcc[:, ss_hcc_top_pcs[i]] for i in range(1, 4)}
X_top_pcs_ds_mcf7 = {i: X_pca_ds_mcf7[:, ds_mcf7_top_pcs[i]] for i in range(1, 4)}
X_top_pcs_ds_hcc = {i: X_pca_ds_hcc[:, ds_hcc_top_pcs[i]] for i in range(1, 4)}

Logistic regression¶

In [ ]:
ss_mcf7_top_pcs_logit: dict[int, TrainedModelWrapper] = {}

ss_hcc_top_pcs_logit: dict[int, TrainedModelWrapper] = {}

ds_mcf7_top_pcs_logit: dict[int, TrainedModelWrapper] = {}

ds_hcc_top_pcs_logit: dict[int, TrainedModelWrapper] = {}

SmartSeq MCF7

In [ ]:
for i in range(3, 0, -1):
    ss_mcf7_top_pcs_logit[i] = train_test_logistic_regression(X_top_pcs_ss_mcf7[i], y_ss_mcf7, n_jobs = -1, verbose = False)
    print(f"PCs selected across {i}+ model(s). Accuracy: {ss_mcf7_top_pcs_logit[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.5079365079365079
PCs selected across 2+ model(s). Accuracy: 0.5238095238095238
PCs selected across 1+ model(s). Accuracy: 0.7619047619047619

SmartSeq HCC

In [ ]:
for i in range(3, 0, -1):
    ss_hcc_top_pcs_logit[i] = train_test_logistic_regression(X_top_pcs_ss_hcc[i], y_ss_hcc, n_jobs = -1, verbose = False)
    print(f"PCs selected across {i}+ model(s). Accuracy: {ss_hcc_top_pcs_logit[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.9130434782608695
PCs selected across 2+ model(s). Accuracy: 0.9347826086956522
PCs selected across 1+ model(s). Accuracy: 0.8913043478260869

DropSeq MCF7

In [ ]:
for i in range(3, 0, -1):
    ds_mcf7_top_pcs_logit[i] = train_test_logistic_regression(X_top_pcs_ds_mcf7[i], y_ds_mcf7, n_jobs = -1, verbose = False)
    print(f"PCs selected across {i}+ model(s). Accuracy: {ds_mcf7_top_pcs_logit[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.9258368781209543
PCs selected across 2+ model(s). Accuracy: 0.9315701867948956
PCs selected across 1+ model(s). Accuracy: 0.9408174588496394

DropSeq HCC

In [ ]:
for i in range(3, 0, -1):
    ds_hcc_top_pcs_logit[i] = train_test_logistic_regression(X_top_pcs_ds_hcc[i], y_ds_hcc, n_jobs = -1, verbose = False)
    print(f"PCs selected across {i}+ model(s). Accuracy: {ds_hcc_top_pcs_logit[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.8673385998365568
PCs selected across 2+ model(s). Accuracy: 0.8921274856987197
PCs selected across 1+ model(s). Accuracy: 0.9199128302914737

SVM¶

In [ ]:
ss_mcf7_top_pcs_svm: dict[int, TrainedModelWrapper] = {}

ss_hcc_top_pcs_svm: dict[int, TrainedModelWrapper] = {}

ds_mcf7_top_pcs_svm: dict[int, TrainedModelWrapper] = {}

ds_hcc_top_pcs_svm: dict[int, TrainedModelWrapper] = {}

SmartSeq MCF7

In [ ]:
for i in range(3, 0, -1):
    ss_mcf7_top_pcs_svm[i] = train_test_svm(X_top_pcs_ss_mcf7[i], y_ss_mcf7, n_jobs = -1, verbose = False)
    print(f"PCs selected across {i}+ model(s). Accuracy: {ss_mcf7_top_pcs_svm[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.49206349206349204
PCs selected across 2+ model(s). Accuracy: 0.47619047619047616
PCs selected across 1+ model(s). Accuracy: 0.746031746031746

SmartSeq HCC

In [ ]:
for i in range(3, 0, -1):
    ss_hcc_top_pcs_svm[i] = train_test_svm(X_top_pcs_ss_hcc[i], y_ss_hcc, n_jobs = -1, verbose = False)
    print(f"PCs selected across {i}+ model(s). Accuracy: {ss_hcc_top_pcs_svm[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.9347826086956522
PCs selected across 2+ model(s). Accuracy: 0.9347826086956522
PCs selected across 1+ model(s). Accuracy: 0.8695652173913043

DropSeq MCF7

In [ ]:
for i in range(3, 0, -1):
    ds_mcf7_top_pcs_svm[i] = train_test_svm(X_top_pcs_ds_mcf7[i], y_ds_mcf7, n_jobs = -1, verbose = False)
    print(f"PCs selected across {i}+ model(s). Accuracy: {ds_mcf7_top_pcs_svm[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.9238024782689107
PCs selected across 2+ model(s). Accuracy: 0.926391714444239
PCs selected across 1+ model(s). Accuracy: 0.9371185500277418

DropSeq HCC

In [ ]:
for i in range(3, 0, -1):
    ds_hcc_top_pcs_svm[i] = train_test_svm(X_top_pcs_ds_hcc[i], y_ds_hcc, n_jobs = -1, verbose = False)
    print(f"PCs selected across {i}+ model(s). Accuracy: {ds_hcc_top_pcs_svm[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.8662489784799782
PCs selected across 2+ model(s). Accuracy: 0.8902206483247072
PCs selected across 1+ model(s). Accuracy: 0.9212748569871969

Random forest¶

In [ ]:
ss_mcf7_top_pcs_random_forest: dict[int, TrainedModelWrapper] = {}

ss_hcc_top_pcs_random_forest: dict[int, TrainedModelWrapper] = {}

ds_mcf7_top_pcs_random_forest: dict[int, TrainedModelWrapper] = {}

ds_hcc_top_pcs_random_forest: dict[int, TrainedModelWrapper] = {}

SmartSeq MCF7

In [ ]:
for i in range(3, 0, -1):
    ss_mcf7_top_pcs_random_forest[i] = train_test_random_forest(X_top_pcs_ss_mcf7[i], y_ss_mcf7, n_jobs = -1, verbose = False)
    print(f"PCs selected across {i}+ model(s). Accuracy: {ss_mcf7_top_pcs_random_forest[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.9206349206349206
PCs selected across 2+ model(s). Accuracy: 0.9365079365079365
PCs selected across 1+ model(s). Accuracy: 0.9841269841269841

SmartSeq HCC

In [ ]:
for i in range(3, 0, -1):
    ss_hcc_top_pcs_random_forest[i] = train_test_random_forest(X_top_pcs_ss_hcc[i], y_ss_hcc, n_jobs = -1, verbose = False)
    print(f"PCs selected across {i}+ model(s). Accuracy: {ss_hcc_top_pcs_random_forest[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.9347826086956522
PCs selected across 2+ model(s). Accuracy: 0.9347826086956522
PCs selected across 1+ model(s). Accuracy: 0.9347826086956522

DropSeq MCF7

In [ ]:
for i in range(3, 0, -1):
    ds_mcf7_top_pcs_random_forest[i] = train_test_random_forest(X_top_pcs_ds_mcf7[i], y_ds_mcf7, n_jobs = -1, verbose = False)
    print(f"PCs selected across {i}+ model(s). Accuracy: {ds_mcf7_top_pcs_random_forest[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.9326798594414648
PCs selected across 2+ model(s). Accuracy: 0.9343443684113186
PCs selected across 1+ model(s). Accuracy: 0.9361938228222675

DropSeq HCC

In [ ]:
for i in range(3, 0, -1):
    ds_hcc_top_pcs_random_forest[i] = train_test_random_forest(X_top_pcs_ds_hcc[i], y_ds_hcc, n_jobs = -1, verbose = False)
    print(f"PCs selected across {i}+ model(s). Accuracy: {ds_hcc_top_pcs_random_forest[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.8703350585671479
PCs selected across 2+ model(s). Accuracy: 0.8714246799237265
PCs selected across 1+ model(s). Accuracy: 0.8855897575592482

Multilayer perceptron¶

In [ ]:
ss_mcf7_top_pcs_mlp: dict[int, TrainedModelWrapper] = {}

ss_hcc_top_pcs_mlp: dict[int, TrainedModelWrapper] = {}

ds_mcf7_top_pcs_mlp: dict[int, TrainedModelWrapper] = {}

ds_hcc_top_pcs_mlp: dict[int, TrainedModelWrapper] = {}

SmartSeq MCF7

In [ ]:
for i in range(3, 0, -1):
    ss_mcf7_top_pcs_mlp[i] = train_test_mlp(X_top_pcs_ss_mcf7[i], y_ss_mcf7, n_jobs = -1, verbose = False)
    print(f"PCs selected across {i}+ model(s). Accuracy: {ss_mcf7_top_pcs_mlp[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.7619047619047619
PCs selected across 2+ model(s). Accuracy: 0.8571428571428571
PCs selected across 1+ model(s). Accuracy: 0.8095238095238095

SmartSeq HCC

In [ ]:
for i in range(3, 0, -1):
    ss_hcc_top_pcs_mlp[i] = train_test_mlp(X_top_pcs_ss_hcc[i], y_ss_hcc, n_jobs = -1, verbose = False)
    print(f"PCs selected across {i}+ model(s). Accuracy: {ss_hcc_top_pcs_mlp[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.9347826086956522
PCs selected across 2+ model(s). Accuracy: 0.9130434782608695
PCs selected across 1+ model(s). Accuracy: 0.9565217391304348

DropSeq MCF7

In [ ]:
for i in range(3, 0, -1):
    ds_mcf7_top_pcs_mlp[i] = train_test_mlp(X_top_pcs_ds_mcf7[i], y_ds_mcf7, n_jobs = -1, verbose = False)
    print(f"PCs selected across {i}+ model(s). Accuracy: {ds_mcf7_top_pcs_mlp[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.9472905492879601
PCs selected across 2+ model(s). Accuracy: 0.953393748844091
PCs selected across 1+ model(s). Accuracy: 0.9628259663399297

DropSeq HCC

In [ ]:
for i in range(3, 0, -1):
    ds_hcc_top_pcs_mlp[i] = train_test_mlp(X_top_pcs_ds_hcc[i], y_ds_hcc, n_jobs = -1, verbose = False)
    print(f"PCs selected across {i}+ model(s). Accuracy: {ds_hcc_top_pcs_mlp[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.8943067284118769
PCs selected across 2+ model(s). Accuracy: 0.9114682647779897
PCs selected across 1+ model(s). Accuracy: 0.9229092890220648

Individual model comparison¶

In [ ]:
def compare_accuracies(logits: dict[int, TrainedModelWrapper], svms: dict[int, TrainedModelWrapper], random_forests: dict[int, TrainedModelWrapper], mlps: dict[int, TrainedModelWrapper], feature_type: str = "feature"):
    max_key = max(logits, key = lambda k: logits[k].accuracy)
    max_accuracy = logits[max_key].accuracy

    print(f"Logistic regression accuracy: {max_accuracy} with {feature_type}s selected across {max_key}+ models.")
    
    max_key = max(svms, key = lambda k: svms[k].accuracy)
    max_accuracy = svms[max_key].accuracy

    print(f"SVM accuracy: {max_accuracy} with {feature_type}s selected across {max_key}+ models.")
    
    max_key = max(random_forests, key = lambda k: random_forests[k].accuracy)
    max_accuracy = random_forests[max_key].accuracy

    print(f"Random forest accuracy: {max_accuracy} with {feature_type}s selected across {max_key}+ models.")
    
    max_key = max(mlps, key = lambda k: mlps[k].accuracy)
    max_accuracy = mlps[max_key].accuracy

    print(f"Multilayer perceptron accuracy: {max_accuracy} with {feature_type}s selected across {max_key}+ models.")

Models trained on selected genes¶

SmartSeq MCF7

In [ ]:
compare_accuracies(ss_mcf7_top_genes_logit, ss_mcf7_top_genes_svm, ss_mcf7_top_genes_random_forest, ss_mcf7_top_genes_mlp, "gene")
Logistic regression accuracy: 1.0 with genes selected across 6+ models.
SVM accuracy: 1.0 with genes selected across 6+ models.
Random forest accuracy: 1.0 with genes selected across 6+ models.
Multilayer perceptron accuracy: 0.9841269841269841 with genes selected across 1+ models.

SmartSeq HCC

In [ ]:
compare_accuracies(ss_hcc_top_genes_logit, ss_hcc_top_genes_svm, ss_hcc_top_genes_random_forest, ss_hcc_top_genes_mlp, "gene")
Logistic regression accuracy: 0.9782608695652174 with genes selected across 12+ models.
SVM accuracy: 0.9782608695652174 with genes selected across 12+ models.
Random forest accuracy: 0.9782608695652174 with genes selected across 6+ models.
Multilayer perceptron accuracy: 0.8695652173913043 with genes selected across 6+ models.

DropSeq MCF7

In [ ]:
compare_accuracies(ds_mcf7_top_genes_logit, ds_mcf7_top_genes_svm, ds_mcf7_top_genes_random_forest, ds_mcf7_top_genes_mlp, "gene")
Logistic regression accuracy: 0.9757721472165711 with genes selected across 1+ models.
SVM accuracy: 0.9750323654521916 with genes selected across 1+ models.
Random forest accuracy: 0.9726280747179582 with genes selected across 1+ models.
Multilayer perceptron accuracy: 0.9783613833918994 with genes selected across 1+ models.

DropSeq HCC

In [ ]:
compare_accuracies(ds_hcc_top_genes_logit, ds_hcc_top_genes_svm, ds_hcc_top_genes_random_forest, ds_mcf7_top_genes_mlp, "gene")
Logistic regression accuracy: 0.9512394442931081 with genes selected across 1+ models.
SVM accuracy: 0.9515118496322528 with genes selected across 1+ models.
Random forest accuracy: 0.9441569054753474 with genes selected across 1+ models.
Multilayer perceptron accuracy: 0.9783613833918994 with genes selected across 1+ models.

As expected, the multilayer perceptron does not predict well on the SmartSeq data, as the data set is too small. The three other models perform very well when trained on selected genes for SmartSeq, with no significant differences in accuracy.

On DropSeq, while all the model perform quite well, the multilayer perceptron has a notably higher accuracy for DropSeq HCC.

For the larger data sets, the highest accuracy scores came from training the model on the genes that had been selected at least once by a model, which makes sense since this provides a larger pool of features (all of which have already been selected and have stronger predictive power).

Models trained on PCA-encoded data¶

SmartSeq MCF7

In [ ]:
print("Logistic regression accuracy:", ss_mcf7_pca_logit.accuracy)
print("SVM accuracy:", ss_mcf7_pca_svm.accuracy)
print("Random forest accuracy:", ss_mcf7_pca_random_forest.accuracy)
print("Multilayer perceptron accuracy:", ss_mcf7_pca_mlp.accuracy)
Logistic regression accuracy: 1.0
SVM accuracy: 1.0
Random forest accuracy: 1.0
Multilayer perceptron accuracy: 0.9682539682539683

SmartSeq HCC

In [ ]:
print("Logistic regression accuracy:", ss_hcc_pca_logit.accuracy)
print("SVM accuracy:", ss_hcc_pca_svm.accuracy)
print("Random forest accuracy:", ss_hcc_pca_random_forest.accuracy)
print("Multilayer perceptron accuracy:", ss_hcc_pca_mlp.accuracy)
Logistic regression accuracy: 0.9782608695652174
SVM accuracy: 0.9782608695652174
Random forest accuracy: 0.9782608695652174
Multilayer perceptron accuracy: 0.8913043478260869

DropSeq MCF7

In [ ]:
print("Logistic regression accuracy:", ds_mcf7_pca_logit.accuracy)
print("SVM accuracy:", ds_mcf7_pca_svm.accuracy)
print("Random forest accuracy:", ds_mcf7_pca_random_forest.accuracy)
print("Multilayer perceptron accuracy:", ds_mcf7_pca_mlp.accuracy)
Logistic regression accuracy: 0.975957092657666
SVM accuracy: 0.9755872017754762
Random forest accuracy: 0.926206769003144
Multilayer perceptron accuracy: 0.9776216016275199

DropSeq HCC

In [ ]:
print("Logistic regression accuracy:", ds_hcc_pca_logit.accuracy)
print("SVM accuracy:", ds_hcc_pca_svm.accuracy)
print("Random forest accuracy:", ds_hcc_pca_random_forest.accuracy)
print("Multilayer perceptron accuracy:", ds_hcc_pca_mlp.accuracy)
Logistic regression accuracy: 0.9566875510760011
SVM accuracy: 0.954508308362844
Random forest accuracy: 0.8948515390901661
Multilayer perceptron accuracy: 0.9591391991283029

All data sets

In [ ]:
print("Logistic regression accuracy:", np.mean([ss_mcf7_pca_logit.accuracy, ss_hcc_pca_logit.accuracy, ds_mcf7_pca_logit.accuracy, ds_hcc_pca_logit.accuracy]))
print("SVM accuracy:", np.mean([ss_mcf7_pca_svm.accuracy, ss_hcc_pca_svm.accuracy, ds_mcf7_pca_svm.accuracy, ds_hcc_pca_svm.accuracy]))
print("Random forest accuracy:", np.mean([ss_mcf7_pca_random_forest.accuracy, ss_hcc_pca_random_forest.accuracy, ds_mcf7_pca_random_forest.accuracy, ds_hcc_pca_random_forest.accuracy]))
print("Multilayer perceptron accuracy:", np.mean([ss_mcf7_pca_mlp.accuracy, ss_hcc_pca_mlp.accuracy, ds_mcf7_pca_mlp.accuracy, ds_hcc_pca_mlp.accuracy]))
Logistic regression accuracy: 0.9777263783247212
SVM accuracy: 0.9770890949258844
Random forest accuracy: 0.9498297944146319
Multilayer perceptron accuracy: 0.9490797792089696

Logistic regression and SVM perform the best on SmartSeq HCC. Where as the multilayer perceptrons are best on both DropSeq data sets.

Models trained on PCA-encoded data, with feature selection¶

SmartSeq MCF7

In [ ]:
compare_accuracies(ss_mcf7_top_pcs_logit, ss_mcf7_top_pcs_svm, ss_mcf7_top_pcs_random_forest, ss_mcf7_top_pcs_mlp, "PC")
Logistic regression accuracy: 0.7936507936507936 with PCs selected across 2+ models.
SVM accuracy: 0.7619047619047619 with PCs selected across 2+ models.
Random forest accuracy: 0.9523809523809523 with PCs selected across 2+ models.
Multilayer perceptron accuracy: 0.8888888888888888 with PCs selected across 2+ models.

SmartSeq HCC

In [ ]:
compare_accuracies(ss_hcc_top_pcs_logit, ss_hcc_top_pcs_svm, ss_hcc_top_pcs_random_forest, ss_hcc_top_pcs_mlp, "PC")
Logistic regression accuracy: 0.9347826086956522 with PCs selected across 3+ models.
SVM accuracy: 0.9130434782608695 with PCs selected across 3+ models.
Random forest accuracy: 0.9130434782608695 with PCs selected across 2+ models.
Multilayer perceptron accuracy: 0.9130434782608695 with PCs selected across 3+ models.

DropSeq MCF7

In [ ]:
compare_accuracies(ds_mcf7_top_pcs_logit, ds_mcf7_top_pcs_svm, ds_mcf7_top_pcs_random_forest, ds_mcf7_top_pcs_mlp, "PC")
Logistic regression accuracy: 0.9347142592935084 with PCs selected across 1+ models.
SVM accuracy: 0.9295357869428519 with PCs selected across 1+ models.
Random forest accuracy: 0.9374884409099316 with PCs selected across 2+ models.
Multilayer perceptron accuracy: 0.9515442944331423 with PCs selected across 1+ models.

DropSeq HCC

In [ ]:
compare_accuracies(ds_hcc_top_pcs_logit, ds_hcc_top_pcs_svm, ds_hcc_top_pcs_random_forest, ds_hcc_top_pcs_mlp, "PC")
Logistic regression accuracy: 0.925905747752656 with PCs selected across 1+ models.
SVM accuracy: 0.925088531735222 with PCs selected across 1+ models.
Random forest accuracy: 0.892672296377009 with PCs selected across 1+ models.
Multilayer perceptron accuracy: 0.9299918278398257 with PCs selected across 1+ models.

While some of the accuracies are not bad, these models trained on selected principal components do not perform as well as their counterparts. This observation is logical as PCA components obfuscate the meaning of the original features, so feature selection doesn't have a meaningful impact.

Ensemble models¶

In search of creating a generalized model with higher accuracy, multiple approaches to hybridization of the individual models will be examined.

Simple majority vote¶

The simplest ensemble model simply takes the predictions from the provided models and takes the majority vote.

In [ ]:
class SimpleMajorityVoteClassifier():   
    def __init__(self, models: list[TrainedModelWrapper]):
        """
        models: list of pretrained classifiers with .predict() method
        """
        self.models = models
        
        test_sets = [(model.X_test, model.y_test) for model in self.models]
        self.assert_test_sets_equal(test_sets)
        
        self.X_test = models[0].X_test
        self.y_test = models[0].y_test
        
        self.features = self.X_test.columns
        self.accuracy = None
        
    def predict(self, X):
        if missing := set(self.features) - set(X.columns):
            raise ValueError(f"Missing columns:", missing)
        
        predictions = [model.predict(X) for model in self.models]
        predictions = np.vstack(predictions)

        predictions_T = predictions.T

        majority_votes = []

        for sample_preds in predictions_T:
            values, counts = np.unique(sample_preds, return_counts = True)
            best_label = values[np.argmax(counts)]
            majority_votes.append(best_label)

        return np.array(majority_votes)
    
    def assert_test_sets_equal(self, test_sets: list):
        ref_X, ref_y = test_sets[0]
        for i, (X, y) in enumerate(test_sets[1:], start=1):
            if not (np.array_equal(ref_X, X) and np.array_equal(ref_y, y)):
                raise ValueError(f"Test set {i} does not match the reference test set.")
    
    def test(self, X_test: Any | None = None, y_test: Any | None = None, verbose: bool = True):
        if not X_test or not y_test:
            X_test = self.X_test
            y_test = self.y_test
        
        self.accuracy = test_model(self, X_test, y_test, verbose)
        return self.accuracy

Since the multilayer perceptron does not work well with small data sets, it is omitted from the SmartSeq ensembles.

In [ ]:
ss_mcf7_simple_ensemble = SimpleMajorityVoteClassifier([
    ss_mcf7_top_genes_logit[1],
    ss_mcf7_top_genes_svm[1],
    ss_mcf7_top_genes_random_forest[1]
])
ss_mcf7_simple_ensemble.test()
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo              31               0
Actual Norm               0              32
Accuracy: 1.0
Classification report:
               precision    recall  f1-score   support

        Hypo       1.00      1.00      1.00        31
        Norm       1.00      1.00      1.00        32

    accuracy                           1.00        63
   macro avg       1.00      1.00      1.00        63
weighted avg       1.00      1.00      1.00        63

1.0
In [ ]:
ss_hcc_simple_ensemble = SimpleMajorityVoteClassifier([
    ss_hcc_top_genes_logit[1],
    ss_hcc_top_genes_svm[1],
    ss_hcc_top_genes_random_forest[1]
])
ss_hcc_simple_ensemble.test()
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo              24               1
Actual Norm               0              21
Accuracy: 0.9782608695652174
Classification report:
               precision    recall  f1-score   support

        Hypo       1.00      0.96      0.98        25
        Norm       0.95      1.00      0.98        21

    accuracy                           0.98        46
   macro avg       0.98      0.98      0.98        46
weighted avg       0.98      0.98      0.98        46

0.9782608695652174
In [ ]:
ds_mcf7_simple_ensemble = SimpleMajorityVoteClassifier([
    ds_mcf7_top_genes_logit[1],
    ds_mcf7_top_genes_svm[1],
    ds_mcf7_top_genes_random_forest[1],
    ds_mcf7_top_genes_mlp[1]
])
ds_mcf7_simple_ensemble.test()
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo            2164              66
Actual Norm              49            3128
Accuracy: 0.9787312742740891
Classification report:
               precision    recall  f1-score   support

        Hypo       0.98      0.97      0.97      2230
        Norm       0.98      0.98      0.98      3177

    accuracy                           0.98      5407
   macro avg       0.98      0.98      0.98      5407
weighted avg       0.98      0.98      0.98      5407

0.9787312742740891
In [ ]:
ds_hcc_simple_ensemble = SimpleMajorityVoteClassifier([
    ds_hcc_top_genes_logit[1],
    ds_hcc_top_genes_svm[1],
    ds_hcc_top_genes_random_forest[1],
    ds_hcc_top_genes_mlp[1]
])
ds_hcc_simple_ensemble.test()
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo            2126              99
Actual Norm              77            1369
Accuracy: 0.952056660310542
Classification report:
               precision    recall  f1-score   support

        Hypo       0.97      0.96      0.96      2225
        Norm       0.93      0.95      0.94      1446

    accuracy                           0.95      3671
   macro avg       0.95      0.95      0.95      3671
weighted avg       0.95      0.95      0.95      3671

0.952056660310542

Weighted majority vote¶

A slightly improved version of the simple majority vote model is a weighted majority vote model, where the predictions are weighted by the individual model's test accuracy. In theory, this could help improve accuracy ensuring that higher quality votes are worth more.

However, in practice, to test this model would require a different subdivision of the original data set to have a training set for the individual models, a test set for the individual models to obtain their accuracies, and a second test set for the hybrid model to verify its effectiveness. Given the computational complexity of retraining all of the models on a different split of the data set, this model will not be tested but instead stands as a potential improvement.

In [ ]:
class WeightedMajorityVoteClassifier():   
    def __init__(self, models: list[TrainedModelWrapper]):
        """
        models: list of pretrained classifiers with .predict() method
        """
        self.models = models
        
        test_sets = [(model.X_test, model.y_test) for model in self.models]
        self.assert_test_sets_equal(test_sets)
        
        self.X_test = models[0].X_test
        self.y_test = models[0].y_test
        
        self.features = self.X_test.columns
        self.accuracy = None
        
    def predict(self, X):
        if missing := set(self.features) - set(X.columns):
            raise ValueError(f"Missing columns:", missing)
        
        predictions = [model.predict(X) for model in self.models]
        predictions = np.vstack(predictions)
        
        weights = np.array([model.accuracy for model in self.models])

        predictions_T = predictions.T

        weighted_votes = []

        for sample_preds in predictions_T:
            unique_labels = np.unique(sample_preds)
            label_weights = {label: 0.0 for label in unique_labels}

            for pred, weight in zip(sample_preds, weights):
                label_weights[pred] += weight

            best_label = max(label_weights, key = label_weights.get)
            weighted_votes.append(best_label)

        return np.array(weighted_votes)
    
    def assert_test_sets_equal(self, test_sets: list):
        ref_X, ref_y = test_sets[0]
        for i, (X, y) in enumerate(test_sets[1:], start=1):
            if not (np.array_equal(ref_X, X) and np.array_equal(ref_y, y)):
                raise ValueError(f"Test set {i} does not match the reference test set.")
    
    def test(self, X_test: Any | None = None, y_test: Any | None = None, verbose: bool = True):
        if not X_test or not y_test:
            X_test = self.X_test
            y_test = self.y_test
        
        self.accuracy = test_model(self, X_test, y_test, verbose)
        return self.accuracy

Generalized majority vote¶

The ensemble model can be generalized to be agnostic to the data set by comparing the features in the input and the selected features of the various models, then conditionally using certain models depending on the correct subset of features. This generalization over a subset of genes with high predictive power on different datasets produces a very robust model.

Again, due to the computational complexity and data restrictions of dividing the data set into more partitions, this model does not weigh the predictions. However, given sufficient computational power and data, an improved model could potentially be developed by ensembling the WeightedMajorityVoteClassifiers.

In [ ]:
class GeneralizedMajorityVoteClassifier():
    def __init__(self, models: list[SimpleMajorityVoteClassifier]):
        """
        models: list of pretrained classifiers with .predict() method
        """
        self.models = models
        
        self.test_sets = [(model.X_test, model.y_test) for model in self.models]
        self.features = [X.columns for X, _ in self.test_sets]
    
    def predict(self, X):
        input_features = set(X.columns)
        
        predictions = [model.predict(X.loc[:, model.features]) for model in self.models if not set(model.features) - input_features]
        predictions = np.vstack(predictions)

        predictions_T = predictions.T

        majority_votes = []

        for sample_preds in predictions_T:
            values, counts = np.unique(sample_preds, return_counts = True)
            best_label = values[np.argmax(counts)]
            majority_votes.append(best_label)

        return np.array(majority_votes)

    def test(self, X_test, y_test, verbose: bool = True):
        return test_model(self, X_test, y_test, verbose)
In [ ]:
generalized_classifier = GeneralizedMajorityVoteClassifier([ss_mcf7_simple_ensemble, ss_hcc_simple_ensemble, ds_mcf7_simple_ensemble, ds_hcc_simple_ensemble])
generalized_classifier_scores = {}
In [ ]:
generalized_classifier_scores["ss_mcf7"] = generalized_classifier.test(X_ss_mcf7, y_ss_mcf7)
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo             124               0
Actual Norm               0             126
Accuracy: 1.0
Classification report:
               precision    recall  f1-score   support

        Hypo       1.00      1.00      1.00       124
        Norm       1.00      1.00      1.00       126

    accuracy                           1.00       250
   macro avg       1.00      1.00      1.00       250
weighted avg       1.00      1.00      1.00       250

In [ ]:
generalized_classifier_scores["ss_hcc"] = generalized_classifier.test(X_ss_hcc, y_ss_hcc)
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo              96               1
Actual Norm               0              85
Accuracy: 0.9945054945054945
Classification report:
               precision    recall  f1-score   support

        Hypo       1.00      0.99      0.99        97
        Norm       0.99      1.00      0.99        85

    accuracy                           0.99       182
   macro avg       0.99      0.99      0.99       182
weighted avg       0.99      0.99      0.99       182

In [ ]:
generalized_classifier_scores["ds_mcf7"] = generalized_classifier.test(X_ds_mcf7, y_ds_mcf7)
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo            8806             115
Actual Norm             107           12598
Accuracy: 0.9897345787478036
Classification report:
               precision    recall  f1-score   support

        Hypo       0.99      0.99      0.99      8921
        Norm       0.99      0.99      0.99     12705

    accuracy                           0.99     21626
   macro avg       0.99      0.99      0.99     21626
weighted avg       0.99      0.99      0.99     21626

In [ ]:
generalized_classifier_scores["ds_hcc"] = generalized_classifier.test(X_ds_hcc, y_ds_hcc)
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo            8721             178
Actual Norm             189            5594
Accuracy: 0.9750034055305816
Classification report:
               precision    recall  f1-score   support

        Hypo       0.98      0.98      0.98      8899
        Norm       0.97      0.97      0.97      5783

    accuracy                           0.98     14682
   macro avg       0.97      0.97      0.97     14682
weighted avg       0.97      0.98      0.97     14682

Final comparisons¶

In [ ]:
def plot_accuracies(
    data_dict: dict[str, dict[str, float]],
    title: str = "Model Accuracies",
    ylabel: str = "Accuracy (%)",
    xlabel = "Models"
):
    colors = plt.cm.tab10.colors
    category_color_map = {}
    entries = []

    # Flatten and store (model_name, accuracy, category)
    for i, (category, models) in enumerate(data_dict.items()):
        category_color_map[category] = colors[i % len(colors)]
        for model_name, acc in models.items():
            acc = acc * 100 if 0 <= acc <= 1 else acc
            entries.append((model_name, acc, category))

    # Sort by accuracy (descending)
    entries.sort(key = lambda x: x[1], reverse = True)

    # Extract values for plotting
    model_labels = [f"{model}" for model, _, _ in entries]
    accuracies = [accuracy for _, accuracy, _ in entries]
    categories = [category for _, _, category in entries]
    bar_colors = [category_color_map[category] for category in categories]
    x_positions = np.arange(len(entries))

    plt.figure(figsize = (max(10, len(entries) * 0.6), 8))

    bars = plt.bar(x_positions, accuracies, color = bar_colors, edgecolor = "black")

    # Set Y-limits before placing text
    min_y = min(accuracies)
    max_y = max(accuracies)
    if max_y - min_y < 5:
        padding = max(1.0, (max_y - min_y) * 0.2)
    else:
        padding = (max_y - min_y) * 0.2
    plt.ylim(min_y - padding, max_y + padding)

    # Add accuracy labels above bars
    for x, acc in zip(x_positions, accuracies):
        label_y = acc + (padding * 0.2)
        plt.text(x, label_y, f"{acc:.1f}%", ha = "center", va = "top", fontsize = 9)

    # X-axis
    plt.xticks(x_positions, model_labels, rotation = 45, ha = "right")
    plt.ylabel(ylabel)
    plt.xlabel(xlabel)
    plt.title(title)
    plt.grid(axis = "y", linestyle = "--", alpha = 0.7)

    # Legend
    handles = [plt.Rectangle((0, 0), 1, 1, color = category_color_map[category]) for category in category_color_map]
    labels = list(category_color_map.keys())
    plt.legend(handles, labels, title = "Category")

    plt.tight_layout()
    plt.show()

Plots¶

In [ ]:
plot_accuracies(
    {
        "PCA": {
            "Logistic Regression": ss_mcf7_pca_logit.accuracy,
            "SVM": ss_mcf7_pca_svm.accuracy,
            "Random Forest": ss_mcf7_pca_random_forest.accuracy,
            "Multilayer Perceptron": ss_mcf7_pca_mlp.accuracy
        },
        "Top PCs": {
            "Logistic Regression": ss_mcf7_top_pcs_logit[1].accuracy,
            "SVM": ss_mcf7_top_pcs_svm[1].accuracy,
            "Random Forest": ss_mcf7_top_pcs_random_forest[1].accuracy,
            "Multilayer Perceptron": ss_mcf7_top_pcs_mlp[1].accuracy
        },
        "Top Genes": {
            "Logistic Regression": ss_mcf7_top_genes_logit[1].accuracy,
            "SVM": ss_mcf7_top_genes_svm[1].accuracy,
            "Random Forest": ss_mcf7_top_genes_random_forest[1].accuracy,
            "Multilayer Perceptron": ss_mcf7_top_genes_mlp[1].accuracy
        },
        "Ensemble": {
            "Simple": ss_mcf7_simple_ensemble.accuracy,
            "Generalized": generalized_classifier_scores["ss_mcf7"]
        }
    },
    "Model Accuracies: SmartSeq MCF7"
)
No description has been provided for this image
In [ ]:
plot_accuracies(
    {
        "PCA": {
            "Logistic Regression": ss_hcc_pca_logit.accuracy,
            "SVM": ss_hcc_pca_svm.accuracy,
            "Random Forest": ss_hcc_pca_random_forest.accuracy,
            "Multilayer Perceptron": ss_hcc_pca_mlp.accuracy
        },
        "Top PCs": {
            "Logistic Regression": ss_hcc_top_pcs_logit[1].accuracy,
            "SVM": ss_hcc_top_pcs_svm[1].accuracy,
            "Random Forest": ss_hcc_top_pcs_random_forest[1].accuracy,
            "Multilayer Perceptron": ss_hcc_top_pcs_mlp[1].accuracy
        },
        "Top Genes": {
            "Logistic Regression": ss_hcc_top_genes_logit[1].accuracy,
            "SVM": ss_hcc_top_genes_svm[1].accuracy,
            "Random Forest": ss_hcc_top_genes_random_forest[1].accuracy,
            "Multilayer Perceptron": ss_hcc_top_genes_mlp[1].accuracy
        },
        "Ensemble": {
            "Simple": ss_hcc_simple_ensemble.accuracy,
            "Generalized": generalized_classifier_scores["ss_hcc"]
        }
    },
    "Model Accuracies: SmartSeq HCC"
)
No description has been provided for this image
In [ ]:
plot_accuracies(
    {
        "PCA": {
            "Logistic Regression": ds_mcf7_pca_logit.accuracy,
            "SVM": ds_mcf7_pca_svm.accuracy,
            "Random Forest": ds_mcf7_pca_random_forest.accuracy,
            "Multilayer Perceptron": ds_mcf7_pca_mlp.accuracy
        },
        "Top PCs": {
            "Logistic Regression": ds_mcf7_top_pcs_logit[1].accuracy,
            "SVM": ds_mcf7_top_pcs_svm[1].accuracy,
            "Random Forest": ds_mcf7_top_pcs_random_forest[1].accuracy,
            "Multilayer Perceptron": ds_mcf7_top_pcs_mlp[1].accuracy
        },
        "Top Genes": {
            "Logistic Regression": ds_mcf7_top_genes_logit[1].accuracy,
            "SVM": ds_mcf7_top_genes_svm[1].accuracy,
            "Random Forest": ds_mcf7_top_genes_random_forest[1].accuracy,
            "Multilayer Perceptron": ds_mcf7_top_genes_mlp[1].accuracy
        },
        "Ensemble": {
            "Simple": ds_mcf7_simple_ensemble.accuracy,
            "Generalized": generalized_classifier_scores["ds_mcf7"]
        }
    },
    "Model Accuracies: DropSeq MCF7"
)
No description has been provided for this image
In [ ]:
plot_accuracies(
    {
        "PCA": {
            "Logistic Regression": ds_hcc_pca_logit.accuracy,
            "SVM": ds_hcc_pca_svm.accuracy,
            "Random Forest": ds_hcc_pca_random_forest.accuracy,
            "Multilayer Perceptron": ds_hcc_pca_mlp.accuracy
        },
        "Top PCs": {
            "Logistic Regression": ds_hcc_top_pcs_logit[1].accuracy,
            "SVM": ds_hcc_top_pcs_svm[1].accuracy,
            "Random Forest": ds_hcc_top_pcs_random_forest[1].accuracy,
            "Multilayer Perceptron": ds_hcc_top_pcs_mlp[1].accuracy
        },
        "Top Genes": {
            "Logistic Regression": ds_hcc_top_genes_logit[1].accuracy,
            "SVM": ds_hcc_top_genes_svm[1].accuracy,
            "Random Forest": ds_hcc_top_genes_random_forest[1].accuracy,
            "Multilayer Perceptron": ds_hcc_top_genes_mlp[1].accuracy
        },
        "Ensemble": {
            "Simple": ds_hcc_simple_ensemble.accuracy,
            "Generalized": generalized_classifier_scores["ds_hcc"]
        }
    },
    "Model Accuracies: DropSeq HCC"
)
No description has been provided for this image

Final Model¶

In all four data sets, the generalized ensemble model performs the best. This model takes the majority vote from the predictions of four simple ensemble models, which themselves take the majority vote of individual classification models which were trained on the top genes identified through feature selection. Logically, this composite ensemble across different models trained on different data sets produces a very robust and generalizable classifier which performs the best. Of the individual models comprising the ensemble, these are their hyperparameters:

In [ ]:
dataset_types = ["SmartSeq MCF7", "SmartSeq HCC", "DropSeq MCF7", "DropSeq HCC"]
for i in range(0, 4):
    print(f"{dataset_types[i]} =========================================================")
    for model in generalized_classifier.models[i].models:
        print(model.model)
    print()
SmartSeq MCF7 =========================================================
LogisticRegression(C=0.01, max_iter=20000, n_jobs=-1, random_state=10)
LinearSVC(C=0.025, max_iter=10000, random_state=10)
RandomForestClassifier(max_depth=5, n_estimators=25, n_jobs=-1, random_state=10)

SmartSeq HCC =========================================================
LogisticRegression(C=0.01, max_iter=20000, n_jobs=-1, random_state=10)
LinearSVC(C=0.025, max_iter=10000, random_state=10)
RandomForestClassifier(max_depth=5, n_estimators=25, n_jobs=-1, random_state=10)

DropSeq MCF7 =========================================================
LogisticRegression(C=1, max_iter=20000, n_jobs=-1, random_state=10,
                   solver='sag')
LinearSVC(C=0.025, max_iter=10000, random_state=10)
RandomForestClassifier(class_weight='balanced', max_depth=30, n_estimators=400,
                       n_jobs=-1, random_state=10)
MLPClassifier(alpha=0.001, early_stopping=True, hidden_layer_sizes=(200,),
              max_iter=500, random_state=10)

DropSeq HCC =========================================================
LogisticRegression(C=1, max_iter=20000, n_jobs=-1, random_state=10,
                   solver='sag')
LinearSVC(C=0.025, max_iter=10000, random_state=10)
RandomForestClassifier(class_weight='balanced', max_depth=20,
                       min_samples_leaf=2, min_samples_split=10,
                       n_estimators=300, n_jobs=-1, random_state=10)
MLPClassifier(early_stopping=True, hidden_layer_sizes=(200,), max_iter=500,
              random_state=10)

Test predictions¶

In [ ]:
ss_mcf7_test = pd.read_csv("AILab2025/SmartSeq/MCF7_SmartS_Filtered_Normalised_3000_Data_test_anonim.txt", delimiter = " ", engine = "python", index_col = 0)
ss_hcc_test = pd.read_csv("AILab2025/SmartSeq/HCC1806_SmartS_Filtered_Normalised_3000_Data_test_anonim.txt", delimiter = " ", engine = "python", index_col = 0)
ds_mcf7_test = pd.read_csv("AILab2025/DropSeq/MCF7_Filtered_Normalised_3000_Data_test_anonim.txt", delimiter = " ", engine = "python", index_col = 0)
ds_hcc_test = pd.read_csv("AILab2025/DropSeq/HCC1806_Filtered_Normalised_3000_Data_test_anonim.txt", delimiter = " ", engine = "python", index_col = 0)
In [ ]:
ss_mcf7_test_predictions = generalized_classifier.predict(ss_mcf7_test.T)
ss_hcc_test_predictions = generalized_classifier.predict(ss_hcc_test.T)
ds_mcf7_test_predictions = generalized_classifier.predict(ds_mcf7_test.T)
ds_hcc_test_predictions = generalized_classifier.predict(ds_hcc_test.T)
In [ ]:
np.savetxt("test_predictions_ss_mcf7.tsv", ss_mcf7_test_predictions, delimiter = "\t", fmt = "%s")
np.savetxt("test_predictions_ss_hcc.tsv", ss_hcc_test_predictions, delimiter = "\t", fmt = "%s")
np.savetxt("test_predictions_ds_mcf7.tsv", ds_mcf7_test_predictions, delimiter = "\t", fmt = "%s")
np.savetxt("test_predictions_ds_hcc.tsv", ds_hcc_test_predictions, delimiter = "\t", fmt = "%s")

Cross-Dataset Evaluation¶

Set-up¶

In [ ]:
# Number of common genes
def common_genes(data1, data2, name1="Dataset 1", name2="Dataset 2"):
    common_genes = data1.index.intersection(data2.index)
    print(f"Number of common genes in {name1} and {name2}: {len(common_genes)}")

common_genes(ss_mcf7_norm, ss_hcc_norm, "ss_mcf7_norm", "ss_hcc_norm")
common_genes(ds_mcf7_norm, ds_hcc_norm, "ds_mcf7_norm", "ds_hcc_norm")
common_genes(ss_mcf7_norm, ds_mcf7_norm, "ss_mcf7_norm", "ds_mcf7_norm")
common_genes(ss_hcc_norm, ds_hcc_norm, "ss_hcc_norm", "ds_hcc_norm")
Number of common genes in ss_mcf7_norm and ss_hcc_norm: 1208
Number of common genes in ds_mcf7_norm and ds_hcc_norm: 834
Number of common genes in ss_mcf7_norm and ds_mcf7_norm: 496
Number of common genes in ss_hcc_norm and ds_hcc_norm: 516
In [ ]:
def train_cross_dataset_classifier(
    df_train: pd.DataFrame,
    df_test: pd.DataFrame,
    model_func: Callable,
    test_func: Callable,
    random_state: int = 42,
    use_pca: bool = False,
    pca_var_threshold: float = 0.95,
    verbose=True
) -> tuple[ClassifierMixin, np.ndarray, np.ndarray, list, list]:
    """
    Trains a classifier on one dataset and evaluates it on another after optional PCA alignment.

    Args:
        df_train (pd.DataFrame): Training gene expression data (genes x cells).
        df_test (pd.DataFrame): Testing gene expression data (genes x cells).
        model_func (Callable): Function like train_svm(X_train, y_train, ...) returning a trained model.
        test_func (Callable): Function to test the models.
        random_state (int): Random seed.
        use_pca (bool): Whether to apply PCA before training.
        pca_var_threshold (float): Variance threshold for PCA if used.

    Returns:
        model: Trained model.
        X_train, X_test: Input data (PCA-transformed or raw).
        y_train, y_test: Labels.
    """
    # Align common genes
    common_genes = df_train.index.intersection(df_test.index)
    df_train_aligned = df_train.loc[common_genes]
    df_test_aligned = df_test.loc[common_genes]

    assert list(df_train_aligned.index) == list(df_test_aligned.index), "Gene alignment failed"

    # Transpose to cells × genes
    X_train = df_train_aligned.T
    X_test = df_test_aligned.T

    # Optionally apply PCA
    # Can the axes of variation in X_train explain the structure of X_test?
    if use_pca:
        pca = PCA(n_components=pca_var_threshold)
        X_train = pca.fit_transform(X_train)  # Learn principal components from training data
        X_test = pca.transform(X_test)  # Apply same components to testing data

    # Extract labels
    y_train = ['Hypo' if 'hypo' in name.lower() else 'Norm' for name in df_train_aligned.columns]
    y_test = ['Hypo' if 'hypo' in name.lower() else 'Norm' for name in df_test_aligned.columns]

    # Train the model
    model = model_func(X_train=X_train, y_train=y_train, random_state=random_state, n_jobs=-1, verbose=verbose)

    # Evaluate the model
    if test_func is not None:
        test_func(model=model, X_test=X_test, y_test=y_test, verbose=verbose)

    return model, X_train, X_test, y_train, y_test
In [ ]:
def select_best_cross_dataset_model(
    df_train: pd.DataFrame,
    df_test: pd.DataFrame,
    train_funcs: list[Callable],
    test_func: Callable,
    use_pca: bool = False,
    pca_var_threshold: float = 0.95,
    random_state: int = 42,
    verbose=False
):
    """
    Trains multiple models cross-dataset and selects the best one by test accuracy.

    Args:
        df_train, df_test: Gene expression dataframes.
        model_pairs: List of (train_func, test_func).
        use_pca: Whether to apply PCA.
        pca_var_threshold: % variance to retain in PCA.
        random_state: Random seed.

    Returns:
        Prints the best model info and returns (model_name, accuracy).
    """
    best_accuracy = -1
    best_result = None
    best_model_info = {}

    for train_func in train_funcs:
        model_name = train_func.__name__.replace("train_", "").replace("_", " ").title()

        # Wrap train_func to suppress training output
        wrapped_train_func = lambda *args, **kwargs: train_func(*args, **{**kwargs, "verbose": False})

        model, X_train, X_test, y_train, y_test = train_cross_dataset_classifier(
            df_train=df_train,
            df_test=df_test,
            model_func=wrapped_train_func,
            test_func=None,  # skip evaluation for now
            use_pca=use_pca,
            pca_var_threshold=pca_var_threshold,
            random_state=random_state,
            verbose=False
        )

        # Compute accuracy manually
        predictions = model.predict(X_test)
        accuracy = accuracy_score(y_test, predictions)

        print(f"{model_name} test accuracy: {accuracy:.4f}")

        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_result = (model_name, accuracy)
            best_model_info = {
                "train_func": train_func,
                "test_func": test_func,
                "X_test": X_test,
                "y_test": y_test,
                "model": model
            }

    # Re-run training and testing for best model if verbose=True
    if verbose:
        print(f"\n============ Best Model: {best_result[0]} ============")
        best_model_info["train_func"](
            X_train=X_train,
            y_train=y_train,
            random_state=random_state,
            n_jobs=-1,
            verbose=verbose
        )
        best_model_info["test_func"](
            model=best_model_info["model"],
            X_test=best_model_info["X_test"],
            y_test=best_model_info["y_test"]
        )

    return best_result

Independent of Technology¶

In [ ]:
# Find the highest accuracy model for MCF7, training on non-reduced Drop-seq data and testing on Smart-seq
select_best_cross_dataset_model(ds_mcf7_norm, ss_mcf7_norm, [train_svm, train_logistic_regression], test_model, use_pca=False)
Svm test accuracy: 0.9680
Logistic Regression test accuracy: 0.9800
('Logistic Regression', 0.98)
In [ ]:
# With PCA reduced data
select_best_cross_dataset_model(ds_mcf7_norm, ss_mcf7_norm, [train_svm, train_logistic_regression], test_model, use_pca=True)
Svm test accuracy: 0.9680
Logistic Regression test accuracy: 0.9800
('Logistic Regression', 0.98)

For MCF7, training an SVM and a Logistic Regression model on Drop-seq and testing on Smart-seq, we obtain a test accuracy of 0.968 and 0.980, respectively. This result is plausible because the Drop-seq dataset contains around 20,000 cells, enabling the model to learn a robust and generalizable decision boundary. In contrast, the Smart-seq test set is much smaller (approximately 250 cells). The high accuracy suggests that the transcriptional differences between hypoxic and normoxic states in MCF7 are consistently captured across both technologies, allowing strong cross-platform generalization.

In [ ]:
# Find the highest accuracy model for HCC1806, training on non-reduced Drop-seq data and testing on Smart-seq
select_best_cross_dataset_model(ds_hcc_norm, ss_hcc_norm,
    [train_svm, train_logistic_regression], test_model, use_pca=False)
Svm test accuracy: 0.7912
Logistic Regression test accuracy: 0.7857
('Svm', 0.7912087912087912)
In [ ]:
# With PCA reduced data
select_best_cross_dataset_model(ds_hcc_norm, ss_hcc_norm,
    [train_svm, train_logistic_regression], test_model, use_pca=True)
Svm test accuracy: 0.7198
Logistic Regression test accuracy: 0.6978
('Svm', 0.7197802197802198)
In [ ]:
# Note precision & recall
model, *_ = train_cross_dataset_classifier(
    ds_hcc_norm, ss_hcc_norm,
    model_func=train_svm,
    test_func=test_model,
)
========================= Training =========================
Best Parameters: {'C': 0.025}
Best Score (CV avg): 0.8780823960759975
C: 0.025
Penalty: l2
Intercept: [-0.15301313]
Max Iterations: 10000
Number of iterations for convergence: 7
Training accuracy: 0.8914997956681651
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo              96               1
Actual Norm              37              48
Accuracy: 0.7912087912087912
Classification report:
               precision    recall  f1-score   support

        Hypo       0.72      0.99      0.83        97
        Norm       0.98      0.56      0.72        85

    accuracy                           0.79       182
   macro avg       0.85      0.78      0.78       182
weighted avg       0.84      0.79      0.78       182

Independent of Cell Line¶

In [ ]:
# Find the highest accuracy model for Smart-seq using non-reduced data
select_best_cross_dataset_model(ss_hcc_norm, ss_mcf7_norm,
    [train_svm, train_logistic_regression, train_random_forest], test_model, use_pca=False)
Svm test accuracy: 0.9200
Logistic Regression test accuracy: 0.9240
Random Forest test accuracy: 0.8880
('Logistic Regression', 0.924)
In [ ]:
# With PCA reduced data
select_best_cross_dataset_model(ss_hcc_norm, ss_mcf7_norm,
    [train_svm, train_logistic_regression, train_random_forest], test_model, use_pca=True)
Svm test accuracy: 0.8280
Logistic Regression test accuracy: 0.9560
Random Forest test accuracy: 0.9920
('Random Forest', 0.992)

We get a near perfect score for the Random Forest classifier on PCA-reduced data (0.992), suggesting that the model was able to extract highly generalizable features from the HCC1806 Smart-seq expression profiles that transfer effectively to MCF7 cells, despite differences in cell type. Interestingly, using raw (non-PCA) data, Random Forest becomes the worst performing model (0.888). This contrast implies that dimensionality reduction was crucial in mitigating overfitting and emphasizing informative variance, especially for tree-based models prone to memorizing noisy patterns in high-dimensional input.

On the other hand, SVM performs slightly better without PCA, which is expected in settings where the signal is linearly separable or preserved across many genes. Since PCA transforms the data into a lower-dimensional space by discarding some variance, SVM may lose access to subtle but discriminative features. This highlights a trade-off: while PCA improves generalization for models sensitive to noise (like Random Forest), it can limit the expressive power of models that benefit from the full feature space when regularized properly.

In [ ]:
# Find the highest accuracy model for Drop-seq using non-reduced data
select_best_cross_dataset_model(ds_hcc_norm, ds_mcf7_norm,
    [train_svm, train_logistic_regression], test_model, use_pca=False)
Svm test accuracy: 0.6995
Logistic Regression test accuracy: 0.7066
('Logistic Regression', 0.7065569222232498)
In [ ]:
# With PCA reduced data
select_best_cross_dataset_model(ds_hcc_norm, ds_mcf7_norm,
    [train_svm, train_logistic_regression, train_random_forest], test_model, use_pca=True)
Svm test accuracy: 0.7030
Logistic Regression test accuracy: 0.7040
Random Forest test accuracy: 0.6643
('Logistic Regression', 0.7039674465920651)
In [ ]:
# Note precision & recall
model, *_ = train_cross_dataset_classifier(
    ds_hcc_norm, ds_mcf7_norm,
    model_func=train_svm,
    test_func=test_model,
)
========================= Training =========================
Best Parameters: {'C': 0.025}
Best Score (CV avg): 0.9119331344241793
C: 0.025
Penalty: l2
Intercept: [-0.28561926]
Max Iterations: 10000
Number of iterations for convergence: 7
Training accuracy: 0.9298460700177088
========================= Testing =========================
Confusion matrix:
             Predicted Hypo  Predicted Norm
Actual Hypo            7063            1858
Actual Norm            4640            8065
Accuracy: 0.6995283455100342
Classification report:
               precision    recall  f1-score   support

        Hypo       0.60      0.79      0.68      8921
        Norm       0.81      0.63      0.71     12705

    accuracy                           0.70     21626
   macro avg       0.71      0.71      0.70     21626
weighted avg       0.73      0.70      0.70     21626

Summary¶

When training and testing on the same dataset, the models maintain balanced precision and recall, indicating consistent performance in identifying both hypoxic and normoxic cells. However, in cross-dataset evaluations, recall exceeds precision for the hypoxic class, meaning the models effectively identify most hypoxic cells, but at the cost of misclassifying some normoxic cells as hypoxic. Conversely, for normoxic cells, precision is higher than recall, indicating that the models are more conservative in labeling normoxic cells, missing some but rarely misclassifying hypoxic cells as normoxic.